Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 49


A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Show Answer
Correct Answer: BD

To get a more accurate measure of how code is likely to perform in production, you need to mimic the production environment as closely as possible. This means using production-sized data and production-sized clusters. Interactive execution in notebooks can be helpful during development, but without the same scale of data and resources, it won't provide an accurate representation of production performance. Running all cells together also ensures that any dependencies and sequential processing are respected, providing a more realistic performance measurement.

Discussion

14 comments
Sign in to comment
agreddyOption: D
Feb 21, 2024

D is the correct answer A. Scala is the only language accurately tested using notebooks: Not true. Spark SQL and PySpark can be accurately tested in notebooks, and production performance doesn't solely depend on language choice. B. Production-sized data and clusters are necessary: While ideal, it's not always feasible for development. Smaller datasets and clusters can provide indicative insights. C. IDE and local Spark/Delta Lake: Local environments won't replicate production's scale and configuration fully. E. Jobs UI and Photon: True that Photon benefits scheduled jobs, but Jobs UI can track execution times regardless of Photon usage. However, Jobs UI runs might involve additional overhead compared to notebook cells. Option D addresses the specific limitations of using display() for performance measurement

halleysgOption: D
Mar 6, 2024

D is correct

divingbell17Option: B
Dec 31, 2023

Calling display() forces a job to trigger - doesnt make sense display is used to display a df/table in tabular format, has nothing to do with a job trigger

guillesd
Feb 7, 2024

Actually they mean a spark job. This is true, whenever you call display, spark needs to execute the transformations up to this point to be able to collect the results.

DAN_HOption: D
Feb 3, 2024

As B not talking about how to deal with display() function. We know that way to testing performance for the whole notebook need to avoid using display as it is way to test the code and display the data

tkg13Option: B
Aug 24, 2023

Is it not B?

BrianNguyen95
Aug 27, 2023

Option B one of possibility happening. Option D fully meaning

sturcuOption: B
Oct 16, 2023

Yes D is a True statement. But it does not answer the question. The ask is for "which adjustments will get a more accurate measure of how code is likely to perform in production". Answer D just describes why the chosen approach is not correct. It does not provide a solution.

sturcu
Oct 16, 2023

D would be the answer if it was preceded by: We should avoid calling display() too often or clear the cache before running each cell.

rok21Option: B
Dec 9, 2023

B is correct

ervinshangOption: D
Dec 20, 2023

D is correct

spaceexplorerOption: D
Jan 26, 2024

D is correct

zzzzxOption: B
Jan 31, 2024

B is correct

guillesdOption: B
Feb 7, 2024

Both B and D are correct statements. However, D is not an adjustment (see the question), it is just an afirmation which happens to be correct. B, however, is an adjustment, and it will definitely help with profiling.

Curious76Option: D
Feb 27, 2024

I will go with D

alexvnoOption: B
Mar 13, 2024

Close env size volumes as possible so results make sense

ffsdfdsfdsfdsfdsfOption: B
Mar 13, 2024

These people voting D have no reading comprehension.