Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 21


A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Show Answer
Correct Answer: DE

To ensure that records are processed in less than 10 seconds, the key is to handle microbatch processing more efficiently during peak hours. Decreasing the trigger interval to 5 seconds can help achieve this by triggering batches more frequently, which may prevent records from backing up and large batches from causing spill. This allows more consistent batch processing times and utilizes available resources effectively, reducing the risk of exceeding the 10-second processing requirement.

Discussion

15 comments
Sign in to comment
sturcuOption: E
Oct 11, 2023

Changing trigger interval to "one" will cause this to be a "batch" and will not execute in microbranches. This will not help at all

asmayassinegOption: E
Aug 2, 2023

correct answer is E. D means a job will need to acquire resources in 10s which is impossible without serverless

EertyyOption: E
Aug 30, 2023

correct anwer is E

cotardo2077Option: E
Sep 5, 2023

for sure E

RafaelCFCOption: E
Jan 5, 2024

I believe this is a case of the least bad option, not exactly the best option possible. - A is wrong because in Streaming you very rarely have any executors idle, as all cores are engaged in processing the window of data; - B is wrong because triggering every 30s will not meet the 10s target processing interval; - C is wrong in two manners: increasing shuffle partitions to any number above the number of available cores in the cluster will worsen performance in streaming; also, the checkpoint folder has no connection with trigger time. - D is wrong because, keeping all other things the same as described by the problem, keeping the trigger time as 10s will not change the underlying conditions of the delay (i.e.: too much data to be processed in a timely manner). E is the only option that might improve processing time.

ojudz08Option: E
Feb 14, 2024

E is the answer. Enable the settings uses the 128 MB as the target file size https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size

azurearchOption: E
Sep 9, 2023

A option also is about setting trigger interval to 5 seconds, just to understand.. why its not the answer

azurearchOption: D
Sep 9, 2023

what if in between those 5 seconds trigger interval if there are more records, that would still increase the time it takes to process.. i doubt E is correct. I will go with answer D. it is not to execute all queries within 10 secs. it is to execute trigger now batch every 10 seconds.

azurearchOption: C
Sep 9, 2023

sorry, the caveat is holding all other variables constant.. that means we are not allowed to change trigger intervals. is C the answer then

EertyyOption: E
Sep 21, 2023

correct answer is E

ofedOption: C
Nov 7, 2023

Only C. Even if you trigger more frequently you decrease both load and time for this load. E doesn't change anything.

ervinshangOption: E
Dec 25, 2023

correct answer is E

kz_dataOption: E
Jan 10, 2024

I think is E is correct

DAN_HOption: E
Jan 31, 2024

E is correct as A is wrong because in Streaming you very rarely have any executors idle

imatheushenriqueOption: E
Jun 1, 2024

Considering the best option for performance gain is: E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.