Certified Associate Developer for Apache Spark Exam

Exam Certified Associate Developer for Apache Spark All QuestionsBrowse all questions from this exam

Question 54

The below code block contains a logical error resulting in inefficiency. The code block is intended to efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId. Identify the logical error.

Code block:

storesDF.join(broadcast(employeesDF), "storeId")

The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.

There is never a need to call the broadcast() operation in Apache Spark 3.

The entire line of code should be wrapped in broadcast() rather than just DataFrame employeesDF.

The broadcast() operation will only perform a broadcast join if the Spark property spark.sql.autoBroadcastJoinThreshold is manually set.

Only one of the DataFrames is being broadcasted rather than both of the DataFrames.

Correct Answer: A

The logical error in the code block is that the larger DataFrame, employeesDF, is being broadcasted instead of the smaller DataFrame, storesDF. This defeats the purpose of a broadcast join, which is to optimize performance by broadcasting the smaller DataFrame to all the worker nodes, avoiding the need to shuffle data over the network. To perform a broadcast join efficiently, the smaller DataFrame should be broadcasted.

Discussion

4be8126Option: A

The answer is A. The logical error in the code block is that the larger DataFrame, employeesDF, is being broadcasted instead of the smaller DataFrame, storesDF. This defeats the purpose of a broadcast join, which is to optimize performance by broadcasting the smaller DataFrame to all the worker nodes, avoiding the need to shuffle data over the network. To perform a broadcast join efficiently, the smaller DataFrame should be broadcasted, which in this case is storesDF. The corrected code should be: broadcast(storesDF).join(employeesDF, "storeId")

juliom6Option: A

A si correct: # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.broadcast.html from pyspark.sql import types from pyspark.sql.functions import broadcast df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType()) df_small = spark.range(3) df.join(broadcast(df_small), df.value == df_small.id).show()