Certified Data Engineer Professional Exam - Question 62

Question

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms, and loads the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Examice · Accepted Answer

The best option to meet the service-level agreement requirements with the lowest cost is to schedule a job to execute the pipeline once an hour on a new job cluster. This approach ensures that the data is updated every hour, meeting the requirement. Additionally, using a job cluster that is started and stopped for each run is more cost-effective than keeping a cluster running continuously, as it only incurs compute costs for the actual processing time, which is 10 minutes per hour.

divingbell17 · Answer

B is correct I think.
With option C, the cluster remains on 24/7 with trigger = 60 mins which is more costly

If there is an option with structure streaming with trigger = availablenow, and job scheduled per hour, that would be even more efficient.
https://www.databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

alexvno · Answer

B :  Job cluster is cheap ,  hourly = 60 minutes

spaceexplorer · Answer

B is correct

Curious76 · Answer

Databricks recommends using Structured Streaming with trigger AvailableNow for incremental workloads that do not have low latency requirements.

ofed · Answer

It's either B or D. I think B, because we want the lowest cost.

aragorn_brego · Answer

Scheduling a job to execute the pipeline on an hourly basis aligns with the requirement for data to be updated every hour. Using a job cluster (which is brought up for the job and torn down upon completion) rather than a dedicated interactive cluster will usually be more cost-effective. This is because you are only paying for the compute resources when the job is running, which is 10 minutes out of every hour, rather than paying for an interactive cluster that would be up and running (and incurring costs) continuously.

Certified Data Engineer Professional Exam - Question 62

Discussion