Certified Data Engineer Professional Exam - Question 73

Question

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

Examice · Accepted Answer

Setting the trigger interval to 10 minutes would block a VM permanently and could lead to inefficient use of resources. Using the 'trigger once' option and scheduling the job to execute every 10 minutes will minimize costs for both compute and storage. This method allows efficient use of instance pools by grabbing a VM only when needed, running the job, and returning the VM to the pool, ensuring no unnecessary resource usage while still processing records within the required 10-minute window.

aragorn_brego · Answer

Given that there are frequent microbatches with 0 records being processed, it indicates that the job is polling the source too often. Using the "trigger once" option would allow each microbatch to process all available data and then stop. By scheduling the job to run every 10 minutes, you ensure that the system is not constantly checking for new data when there is none, thus reducing the number of read operations from the source storage and potentially reducing costs associated with those reads.

alexvno · Answer

For production  -> records need to be processed in less than 10 minutes. So we need to schedule each 10 minutes

spaceexplorer · Answer

C is more effective than E as E will incur startup time for spinning new job cluster

hidelux · Answer

The question indicates that they are using instance pools for fast startup time. option C would block a VM permanently which is not intended. E will grab a VM, run the job, and return it to the pool to be available for other jobs mentioned in the question.

vikram12apr · Answer

default trigger time is 0.5 seconds 
Hence in a minute there are 120 triggers happens
Each trigger consume 3 seconds to complete
now 120*3 = 360 seconds = 6 minutes
Hence the job is completing in 6 minutes
Now there is buffer of 4 minutes which can be utilized in compute spin up 
but as we are using the spot instances which will further decrease the start up time 
I think E is correct option to decrease the cost.

Er5 · Answer

required "to be processed in less than 10 minutes".
C. "set the trigger interval to 10 minutes" means Process time + interval > 10 minutes
E. "trigger once", "execute the query every 10 minutes"

divingbell17 · Answer

Both C and E meet the requirement to reduce cloud storage cost. E further reduces compute cost however reducing compute cost is not a requirement in the question.

ranith · Answer

The default trigger interval is 500ms, but the question says it processes batches with 0 records and the avg time to process is 3s. If the requirement is to process under 10 minutes the best option here is to trigger every 3s.

Certified Data Engineer Professional Exam - Question 73

Discussion