Certified Associate Developer for Apache Spark Exam QuestionsBrowse all questions from this exam

Certified Associate Developer for Apache Spark Exam - Question 92


The code block shown below should efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId.

Choose the response that correctly fills in the numbered blanks within the code block to complete this task.

Code block:

__1__.join(__2__(__3__), "storeId")

Show Answer
Correct Answer: A

To efficiently perform a broadcast join, the smaller DataFrame should be broadcasted. In this scenario, storesDF is smaller and employeesDF is much larger. Therefore, the correct code to complete this task is employeesDF.join(broadcast(storesDF), 'storeId'). This ensures that the smaller DataFrame storesDF is broadcasted to all nodes, making the join with the larger DataFrame employeesDF efficient.

Discussion

6 comments
Sign in to comment
ryanmuOption: A
Jun 23, 2023

Correct answer is A. storesDF is smaller and should be broadcasted.

cookiemonster42
Aug 2, 2023

Agreed!

cookiemonster42Option: A
Aug 2, 2023

the larger dataset has to be the initial and the smaller one should be broadcasted

Ram459Option: A
Aug 16, 2023

smaller dataset needs to be broadcasted

veli4koOption: A
Sep 29, 2023

А is the correct answer!

thanabOption: A
Sep 11, 2023

A The correct answer is: A. 1. employeesDF 2. broadcast 3. storesDF So the correct code would be: ```scala employeesDF.join(broadcast(storesDF), "storeId") ``` This code will perform a broadcast join of the DataFrame `storesDF` (which is smaller) with the much larger DataFrame `employeesDF` using the key column `storeId`. The `broadcast()` function is used to mark a DataFrame to be broadcast when performing a join operation. The smaller DataFrame `storesDF` is broadcasted to all nodes, where it's joined with the larger DataFrame `employeesDF`.

azure_bimonsterOption: A
Feb 9, 2024

I would go with A as storesDF is smaller and right one to broadcast