Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 75


A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Show Answer
Correct Answer: CD

To efficiently deduplicate data against previously processed records as it is inserted into a Delta table, performing an insert-only merge with a matching condition on a unique key is the appropriate approach. This technique allows the data engineer to perform upsert operations, meaning that if an incoming record matches an existing record based on the unique key, the existing record can be updated or ignored to handle duplicates. If there is no match, the new record will be inserted. This ensures that duplicates are managed both within the current batch and against previously processed records.

Discussion

6 comments
Sign in to comment
sturcuOption: C
Oct 25, 2023

Merge, when not match insert

hm358Option: C
Oct 29, 2023

merge will be more efficient

DileepvikramOption: C
Nov 9, 2023

Answer is C

aragorn_bregoOption: C
Nov 21, 2023

To handle deduplication against previously processed records in a Delta table, the MERGE INTO command can be used to perform an upsert operation. This means that if the incoming data has a record that matches an existing record based on a unique key, the MERGE INTO operation can update the existing record (if needed) or simply ignore the duplicate. If there is no match (i.e., the record is new), then the record will be inserted

CrocjunOption: C
Oct 22, 2023

C Reference: file:///C:/Users/yuen1/Downloads/databricks-certified-data-engineer-professional-exam-guide.pdf

mouad_attaqi
Oct 25, 2023

you are referencing a local pdf in your computer !!!

60tiesOption: C
Nov 15, 2023

answer is C