Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 8

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

    Correct Answer: B

    Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. The provided code uses the .dropDuplicates method, which removes duplicates in the current batch of data based on the specified keys (customer_id and order_id) before writing to the orders table. However, this method does not check for duplicates that might already exist in the target table from previous writes. Therefore, while each new batch of data will be de-duplicated, duplicates may still persist in the orders table if they were written in earlier batches.

Discussion
EertyyOption: B

B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. Explanation: In the provided code, the .dropDuplicates(["customer_id","order_id"]) operation is performed on the data loaded from the Parquet files. This operation ensures that only unique records, based on the composite key of "customer_id" and "order_id," are retained in the DataFrame before writing to the "orders" table. However, this operation does not consider duplicates that may already exist in the "orders" table. It only filters duplicates from the current batch of data. If there are duplicates in the "orders" table from previous batches, they will remain in the table. So, newly written records will not have duplicates within the batch being written, but duplicates from previous batches may still exist in the target table.

thxsgodOption: B

Correct

panyaOption: B

Yes it should be B

imatheushenriqueOption: B

B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. Using merge this problem would not happen

DavidRouOption: B

B is the right answer. The above code only remove duplicates from the batch that is processed, no logic is applied to already saved records.

Jay_98_11Option: B

B is correct

5ffcd04Option: B

Answer B

kz_dataOption: B

B is correct

viveklaOption: B

correct B

sturcuOption: B

Correct

StarvosxantOption: B

Correct. B