Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 49

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

    Correct Answer: B

    To get a more accurate measure of how code is likely to perform in production, you need to mimic the production environment as closely as possible. This means using production-sized data and production-sized clusters. Interactive execution in notebooks can be helpful during development, but without the same scale of data and resources, it won't provide an accurate representation of production performance. Running all cells together also ensures that any dependencies and sequential processing are respected, providing a more realistic performance measurement.

Discussion
agreddyOption: D

D is the correct answer A. Scala is the only language accurately tested using notebooks: Not true. Spark SQL and PySpark can be accurately tested in notebooks, and production performance doesn't solely depend on language choice. B. Production-sized data and clusters are necessary: While ideal, it's not always feasible for development. Smaller datasets and clusters can provide indicative insights. C. IDE and local Spark/Delta Lake: Local environments won't replicate production's scale and configuration fully. E. Jobs UI and Photon: True that Photon benefits scheduled jobs, but Jobs UI can track execution times regardless of Photon usage. However, Jobs UI runs might involve additional overhead compared to notebook cells. Option D addresses the specific limitations of using display() for performance measurement

halleysgOption: D

D is correct

DAN_HOption: D

As B not talking about how to deal with display() function. We know that way to testing performance for the whole notebook need to avoid using display as it is way to test the code and display the data

divingbell17Option: B

Calling display() forces a job to trigger - doesnt make sense display is used to display a df/table in tabular format, has nothing to do with a job trigger

guillesd

Actually they mean a spark job. This is true, whenever you call display, spark needs to execute the transformations up to this point to be able to collect the results.

ffsdfdsfdsfdsfdsfOption: B

These people voting D have no reading comprehension.

alexvnoOption: B

Close env size volumes as possible so results make sense

Curious76Option: D

I will go with D

guillesdOption: B

Both B and D are correct statements. However, D is not an adjustment (see the question), it is just an afirmation which happens to be correct. B, however, is an adjustment, and it will definitely help with profiling.

zzzzxOption: B

B is correct

spaceexplorerOption: D

D is correct

ervinshangOption: D

D is correct

rok21Option: B

B is correct

sturcuOption: B

Yes D is a True statement. But it does not answer the question. The ask is for "which adjustments will get a more accurate measure of how code is likely to perform in production". Answer D just describes why the chosen approach is not correct. It does not provide a solution.

sturcu

D would be the answer if it was preceded by: We should avoid calling display() too often or clear the cache before running each cell.

tkg13Option: B

Is it not B?

BrianNguyen95

Option B one of possibility happening. Option D fully meaning