Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 73

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

    Correct Answer: E

    Setting the trigger interval to 10 minutes would block a VM permanently and could lead to inefficient use of resources. Using the 'trigger once' option and scheduling the job to execute every 10 minutes will minimize costs for both compute and storage. This method allows efficient use of instance pools by grabbing a VM only when needed, running the job, and returning the VM to the pool, ensuring no unnecessary resource usage while still processing records within the required 10-minute window.

Discussion
aragorn_bregoOption: E

Given that there are frequent microbatches with 0 records being processed, it indicates that the job is polling the source too often. Using the "trigger once" option would allow each microbatch to process all available data and then stop. By scheduling the job to run every 10 minutes, you ensure that the system is not constantly checking for new data when there is none, thus reducing the number of read operations from the source storage and potentially reducing costs associated with those reads.

Gulenur_GS

in this case why not C? Processing trigger in 10 min ensures the same I guess..

Er5Option: E

required "to be processed in less than 10 minutes". C. "set the trigger interval to 10 minutes" means Process time + interval > 10 minutes E. "trigger once", "execute the query every 10 minutes"

vikram12aprOption: E

default trigger time is 0.5 seconds Hence in a minute there are 120 triggers happens Each trigger consume 3 seconds to complete now 120*3 = 360 seconds = 6 minutes Hence the job is completing in 6 minutes Now there is buffer of 4 minutes which can be utilized in compute spin up but as we are using the spot instances which will further decrease the start up time I think E is correct option to decrease the cost.

hideluxOption: E

The question indicates that they are using instance pools for fast startup time. option C would block a VM permanently which is not intended. E will grab a VM, run the job, and return it to the pool to be available for other jobs mentioned in the question.

spaceexplorerOption: C

C is more effective than E as E will incur startup time for spinning new job cluster

alexvnoOption: C

For production -> records need to be processed in less than 10 minutes. So we need to schedule each 10 minutes

ranithOption: A

The default trigger interval is 500ms, but the question says it processes batches with 0 records and the avg time to process is 3s. If the requirement is to process under 10 minutes the best option here is to trigger every 3s.

divingbell17Option: C

Both C and E meet the requirement to reduce cloud storage cost. E further reduces compute cost however reducing compute cost is not a requirement in the question.