Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 66


The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Show Answer
Correct Answer: CE

The %sh magic command in Databricks executes shell code on the driver node. This means that the commands run within this block do not utilize the distributed computing capabilities of Databricks, which leverages multiple nodes for parallel processing. As a result, the operations such as the Git clone, Python script execution, and file move operations are limited to the resources of the driver node only. This lack of parallelism and distribution is a likely reason for the extended execution time for handling 1 GB of data.

Discussion

5 comments
Sign in to comment
aragorn_bregoOption: E
Nov 21, 2023

When using %sh in a Databricks notebook, the commands are executed in a shell environment on the driver node. This means that only the resources of the driver node are used, and the execution does not leverage the distributed computing capabilities of the worker nodes in the Spark cluster. This can result in slower performance, especially for data-intensive tasks, compared to an approach that distributes the workload across all nodes in the cluster using Spark.

sturcuOption: E
Oct 30, 2023

%sh run Bash commands on the driver node of the cluster. https://www.databricks.com/blog/2020/08/31/introducing-the-databricks-web-terminal.html

DileepvikramOption: E
Nov 9, 2023

E is the answer as the command is ran in the driver node and other nodes in the cluster are not used

sturcuOption: C
Oct 24, 2023

you can use mv with %sh, but the syntax is not correct , it is missing the destination operand

sturcu
Oct 30, 2023

I just noticed there is a space between the paths, so syntax is correct

FreyrOption: E
May 28, 2024

Option E: Correct. The %sh magic command in Databricks runs shell commands on the driver node only. This means the operations within %sh do not leverage the distributed nature of the Databricks cluster. Consequently, the Git clone, Python script execution, and file move operations are all performed on a single node (the driver), which explains why it takes a long time to process and move 1 GB of data. This approach does not utilize the parallel processing capabilities of the worker nodes or the optimization features of Databricks Spark. Option C: Incorrect. %sh does not inherently distribute any operations, but the issue here is broader than just file moving operations. Using %fs for file operations is a best practice, but it does not resolve the inefficiency of running all commands on the driver node.