Certified Data Engineer Professional Exam - Question 49

Question

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Examice · Accepted Answer

To get a more accurate measure of how code is likely to perform in production, you need to mimic the production environment as closely as possible. This means using production-sized data and production-sized clusters. Interactive execution in notebooks can be helpful during development, but without the same scale of data and resources, it won't provide an accurate representation of production performance. Running all cells together also ensures that any dependencies and sequential processing are respected, providing a more realistic performance measurement.

agreddy · Answer

D is the correct answer

A. Scala is the only language accurately tested using notebooks: Not true. Spark SQL and PySpark can be accurately tested in notebooks, and production performance doesn't solely depend on language choice.
B. Production-sized data and clusters are necessary: While ideal, it's not always feasible for development. Smaller datasets and clusters can provide indicative insights.
C. IDE and local Spark/Delta Lake: Local environments won't replicate production's scale and configuration fully.
E. Jobs UI and Photon: True that Photon benefits scheduled jobs, but Jobs UI can track execution times regardless of Photon usage. However, Jobs UI runs might involve additional overhead compared to notebook cells.
Option D addresses the specific limitations of using display() for performance measurement

halleysg · Answer

D is correct

divingbell17 · Answer

Calling display() forces a job to trigger - doesnt make sense
display is used to display a df/table in tabular format, has nothing to do with a job trigger

DAN_H · Answer

As B not talking about how to deal with display() function. We know that way to testing performance for the whole notebook need to avoid using display as it is way to test the code and display the data

tkg13 · Answer

Is it not B?

sturcu · Answer

Yes D is a True statement. But it does not answer the question.
The ask is for "which adjustments will get a more accurate measure of how code is likely to perform in production". Answer D just describes why the chosen approach is not correct. It does not provide a solution.

rok21 · Answer

B is correct

ervinshang · Answer

D  is correct

spaceexplorer · Answer

D is correct

zzzzx · Answer

B is correct

guillesd · Answer

Both B and D are correct statements. However, D is not an adjustment (see the question), it is just an afirmation which happens to be correct. B, however, is an adjustment, and it will definitely help with profiling.

Curious76 · Answer

I will go with D

alexvno · Answer

Close env size volumes as possible so results make sense

ffsdfdsfdsfdsfdsf · Answer

These people voting D have no reading comprehension.

Certified Data Engineer Professional Exam - Question 49

Discussion