Exam Certified Associate Developer for Apache Spark All QuestionsBrowse all questions from this exam
Question 54

The below code block contains a logical error resulting in inefficiency. The code block is intended to efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId. Identify the logical error.

Code block:

storesDF.join(broadcast(employeesDF), "storeId")

    Correct Answer: A

    The logical error in the code block is that the larger DataFrame, employeesDF, is being broadcasted instead of the smaller DataFrame, storesDF. This defeats the purpose of a broadcast join, which is to optimize performance by broadcasting the smaller DataFrame to all the worker nodes, avoiding the need to shuffle data over the network. To perform a broadcast join efficiently, the smaller DataFrame should be broadcasted.

Discussion
4be8126Option: A

The answer is A. The logical error in the code block is that the larger DataFrame, employeesDF, is being broadcasted instead of the smaller DataFrame, storesDF. This defeats the purpose of a broadcast join, which is to optimize performance by broadcasting the smaller DataFrame to all the worker nodes, avoiding the need to shuffle data over the network. To perform a broadcast join efficiently, the smaller DataFrame should be broadcasted, which in this case is storesDF. The corrected code should be: broadcast(storesDF).join(employeesDF, "storeId")

juliom6Option: A

A si correct: # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.broadcast.html from pyspark.sql import types from pyspark.sql.functions import broadcast df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType()) df_small = spark.range(3) df.join(broadcast(df_small), df.value == df_small.id).show()