Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 8

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will fail.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Show Answer

Correct Answer: B

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. The provided code uses the .dropDuplicates method, which removes duplicates in the current batch of data based on the specified keys (customer_id and order_id) before writing to the orders table. However, this method does not check for duplicates that might already exist in the target table from previous writes. Therefore, while each new batch of data will be de-duplicated, duplicates may still persist in the orders table if they were written in earlier batches.

Discussion

11 comments

EertyyOption: B

Sep 21, 2023

B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. Explanation: In the provided code, the .dropDuplicates(["customer_id","order_id"]) operation is performed on the data loaded from the Parquet files. This operation ensures that only unique records, based on the composite key of "customer_id" and "order_id," are retained in the DataFrame before writing to the "orders" table. However, this operation does not consider duplicates that may already exist in the "orders" table. It only filters duplicates from the current batch of data. If there are duplicates in the "orders" table from previous batches, they will remain in the table. So, newly written records will not have duplicates within the batch being written, but duplicates from previous batches may still exist in the target table.

thxsgodOption: B

Sep 7, 2023

Correct

StarvosxantOption: B

Oct 9, 2023

Correct. B

sturcuOption: B

Oct 11, 2023

Correct

viveklaOption: B

Nov 30, 2023

correct B

kz_dataOption: B

Dec 21, 2023

B is correct

5ffcd04Option: B

Jan 1, 2024

Answer B

Jay_98_11Option: B

Jan 13, 2024

B is correct

DavidRouOption: B

Mar 9, 2024

B is the right answer. The above code only remove duplicates from the batch that is processed, no logic is applied to already saved records.

imatheushenriqueOption: B

Jun 5, 2024

B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. Using merge this problem would not happen

panyaOption: B

Jun 24, 2024

Yes it should be B