Certified Data Engineer Professional Exam - Question 21

Question

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Examice · Accepted Answer

To ensure that records are processed in less than 10 seconds, the key is to handle microbatch processing more efficiently during peak hours. Decreasing the trigger interval to 5 seconds can help achieve this by triggering batches more frequently, which may prevent records from backing up and large batches from causing spill. This allows more consistent batch processing times and utilizes available resources effectively, reducing the risk of exceeding the 10-second processing requirement.

sturcu · Answer

Changing trigger interval to "one" will cause this to be a "batch" and will not execute in microbranches. This will not help at all

asmayassineg · Answer

correct answer is E. D means a job will need to acquire resources in 10s which is impossible without serverless

Eertyy · Answer

correct anwer is E

cotardo2077 · Answer

for sure E

RafaelCFC · Answer

I believe this is a case of the least bad option, not exactly the best option possible.

- A is wrong because in Streaming you very rarely have any executors idle, as all cores are engaged in processing the window of data;
- B is wrong because triggering every 30s will not meet the 10s target processing interval;
- C is wrong in two manners: increasing shuffle partitions to any number above the number of available cores in the cluster will worsen performance in streaming; also, the checkpoint folder has no connection with trigger time.
- D is wrong because, keeping all other things the same as described by the problem, keeping the trigger time as 10s will not change the underlying conditions of the delay (i.e.: too much data to be processed in a timely manner).

E is the only option that might improve processing time.

ojudz08 · Answer

E is the answer. 
Enable the settings uses the 128 MB as the target file size
https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size

azurearch · Answer

A option also is about setting trigger interval to 5 seconds, just to understand.. why its not the answer

azurearch · Answer

what if in between those 5 seconds trigger interval if there are more records, that would still increase the time it takes to process.. i doubt E is correct. I will go with answer D. it is not to execute all queries within 10 secs. it is to execute trigger now batch every 10 seconds.

azurearch · Answer

sorry, the caveat is holding all other variables constant.. that means we are not allowed to change trigger intervals. is C the answer then

Eertyy · Answer

correct answer is E

ofed · Answer

Only C. Even if you trigger more frequently you decrease both load and time for this load. E doesn't change anything.

ervinshang · Answer

correct answer is E

kz_data · Answer

I think is E is correct

DAN_H · Answer

E is correct as A is wrong because in Streaming you very rarely have any executors idle

imatheushenrique · Answer

Considering the best option for performance gain is:
E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Certified Data Engineer Professional Exam - Question 21

Discussion