Certified Data Engineer Professional Exam - Question 66

Question

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Examice · Accepted Answer

The %sh magic command in Databricks executes shell code on the driver node. This means that the commands run within this block do not utilize the distributed computing capabilities of Databricks, which leverages multiple nodes for parallel processing. As a result, the operations such as the Git clone, Python script execution, and file move operations are limited to the resources of the driver node only. This lack of parallelism and distribution is a likely reason for the extended execution time for handling 1 GB of data.

aragorn_brego · Answer

When using %sh in a Databricks notebook, the commands are executed in a shell environment on the driver node. This means that only the resources of the driver node are used, and the execution does not leverage the distributed computing capabilities of the worker nodes in the Spark cluster. This can result in slower performance, especially for data-intensive tasks, compared to an approach that distributes the workload across all nodes in the cluster using Spark.

sturcu · Answer

%sh run Bash commands on the driver node of the cluster.
https://www.databricks.com/blog/2020/08/31/introducing-the-databricks-web-terminal.html

Dileepvikram · Answer

E is the answer as the command is ran in the driver node and other nodes in the cluster are not used

sturcu · Answer

you can use mv with %sh, but the syntax is not correct , it is missing the destination operand

Freyr · Answer

Option E: Correct. The %sh magic command in Databricks runs shell commands on the driver node only. This means the operations within %sh do not leverage the distributed nature of the Databricks cluster. Consequently, the Git clone, Python script execution, and file move operations are all performed on a single node (the driver), which explains why it takes a long time to process and move 1 GB of data. This approach does not utilize the parallel processing capabilities of the worker nodes or the optimization features of Databricks Spark.

Option C: Incorrect. %sh does not inherently distribute any operations, but the issue here is broader than just file moving operations. Using %fs for file operations is a best practice, but it does not resolve the inefficiency of running all commands on the driver node.

Certified Data Engineer Professional Exam - Question 66

Discussion