Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 62


The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms, and loads the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Show Answer
Correct Answer: BC

The best option to meet the service-level agreement requirements with the lowest cost is to schedule a job to execute the pipeline once an hour on a new job cluster. This approach ensures that the data is updated every hour, meeting the requirement. Additionally, using a job cluster that is started and stopped for each run is more cost-effective than keeping a cluster running continuously, as it only incurs compute costs for the actual processing time, which is 10 minutes per hour.

Discussion

6 comments
Sign in to comment
divingbell17Option: B
Jan 1, 2024

B is correct I think. With option C, the cluster remains on 24/7 with trigger = 60 mins which is more costly If there is an option with structure streaming with trigger = availablenow, and job scheduled per hour, that would be even more efficient. https://www.databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

alexvnoOption: B
Dec 18, 2023

B : Job cluster is cheap , hourly = 60 minutes

spaceexplorerOption: B
Jan 24, 2024

B is correct

Curious76Option: C
Feb 26, 2024

Databricks recommends using Structured Streaming with trigger AvailableNow for incremental workloads that do not have low latency requirements.

ofedOption: B
Nov 16, 2023

It's either B or D. I think B, because we want the lowest cost.

aragorn_bregoOption: B
Nov 21, 2023

Scheduling a job to execute the pipeline on an hourly basis aligns with the requirement for data to be updated every hour. Using a job cluster (which is brought up for the job and torn down upon completion) rather than a dedicated interactive cluster will usually be more cost-effective. This is because you are only paying for the compute resources when the job is running, which is 10 minutes out of every hour, rather than paying for an interactive cluster that would be up and running (and incurring costs) continuously.