Certified Data Engineer Professional Exam - Question 75

Question

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Examice · Accepted Answer

To efficiently deduplicate data against previously processed records as it is inserted into a Delta table, performing an insert-only merge with a matching condition on a unique key is the appropriate approach. This technique allows the data engineer to perform upsert operations, meaning that if an incoming record matches an existing record based on the unique key, the existing record can be updated or ignored to handle duplicates. If there is no match, the new record will be inserted. This ensures that duplicates are managed both within the current batch and against previously processed records.

sturcu · Answer

Merge, when not match insert

hm358 · Answer

merge will be more efficient

Dileepvikram · Answer

Answer is C

aragorn_brego · Answer

To handle deduplication against previously processed records in a Delta table, the MERGE INTO command can be used to perform an upsert operation. This means that if the incoming data has a record that matches an existing record based on a unique key, the MERGE INTO operation can update the existing record (if needed) or simply ignore the duplicate. If there is no match (i.e., the record is new), then the record will be inserted

Crocjun · Answer

C
Reference: file:///C:/Users/yuen1/Downloads/databricks-certified-data-engineer-professional-exam-guide.pdf

60ties · Answer

answer is C

Certified Data Engineer Professional Exam - Question 75

Discussion