Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 95

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.

If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

    Correct Answer: A

    The provided Databricks job reads a stream of Parquet data, applies a watermark on the 'time' field, and deduplicates records based on 'customer_id' and 'order_id' only within the specified 2-hour window. This means that if duplicate records are enqueued more than 2 hours apart, they will not be recognized as duplicates and will both be written to the 'orders' table. Hence, the table may contain duplicate records with the same 'customer_id' and 'order_id' if they were queued more than 2 hours apart.

Discussion
alexvnoOption: A

Only A seems logical

Isio05Option: A

It's A, rows are deduplicated only in 2hrs window, therefore final table may eventually contain duplicates

QuangTrinhOption: E

Should be E. Watermarking (withWatermark("time", "2 hours")): This sets a 2-hour watermark on the time column. The watermark specifies the event time threshold for data completeness, meaning that data older than 2 hours will be considered late and may be dropped. Deduplication (dropDuplicates(["customer_id", "order_id"])): This operation removes duplicates based on the composite key (customer_id and order_id). However, it only works within the window defined by the watermark.