Exam Certified Associate Developer for Apache Spark All QuestionsBrowse all questions from this exam
Question 92

The code block shown below should efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId.

Choose the response that correctly fills in the numbered blanks within the code block to complete this task.

Code block:

__1__.join(__2__(__3__), "storeId")

    Correct Answer: A

    To efficiently perform a broadcast join, the smaller DataFrame should be broadcasted. In this scenario, storesDF is smaller and employeesDF is much larger. Therefore, the correct code to complete this task is employeesDF.join(broadcast(storesDF), 'storeId'). This ensures that the smaller DataFrame storesDF is broadcasted to all nodes, making the join with the larger DataFrame employeesDF efficient.

Discussion
ryanmuOption: A

Correct answer is A. storesDF is smaller and should be broadcasted.

cookiemonster42

Agreed!

veli4koOption: A

А is the correct answer!

Ram459Option: A

smaller dataset needs to be broadcasted

cookiemonster42Option: A

the larger dataset has to be the initial and the smaller one should be broadcasted

azure_bimonsterOption: A

I would go with A as storesDF is smaller and right one to broadcast

thanabOption: A

A The correct answer is: A. 1. employeesDF 2. broadcast 3. storesDF So the correct code would be: ```scala employeesDF.join(broadcast(storesDF), "storeId") ``` This code will perform a broadcast join of the DataFrame `storesDF` (which is smaller) with the much larger DataFrame `employeesDF` using the key column `storeId`. The `broadcast()` function is used to mark a DataFrame to be broadcast when performing a join operation. The smaller DataFrame `storesDF` is broadcasted to all nodes, where it's joined with the larger DataFrame `employeesDF`.