Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 71

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

    Correct Answer: A

    In a streaming data pipeline using Spark Structured Streaming, to handle and manage late-arriving data while performing aggregations, the `withWatermark` function is used. This function defines how long the framework should wait for late data to arrive before it is considered too late. The watermark value, in this case, 10 minutes, specifies the threshold for lateness. Therefore, the code block should use `withWatermark` to ensure incremental state information is maintained for 10 minutes for late-arriving data.

Discussion
sturcuOption: A

withWatermark. There sliding window is doe through the window function

aragorn_bregoOption: A

To handle late-arriving data in a streaming aggregation, you need to specify a watermark, which tells the streaming query how long to wait for late data. The withWatermark method is used for this purpose in Spark Structured Streaming. It defines the threshold for how late the data can be relative to the latest data that has been seen in the same window.

DileepvikramOption: A

Answer is A