Certified Associate Developer for Apache Spark Exam - Question 92

Question

The code block shown below should efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId.

Choose the response that correctly fills in the numbered blanks within the code block to complete this task.

Code block:

__1__.join(__2__(__3__), "storeId")

Examice · Accepted Answer

To efficiently perform a broadcast join, the smaller DataFrame should be broadcasted. In this scenario, storesDF is smaller and employeesDF is much larger. Therefore, the correct code to complete this task is employeesDF.join(broadcast(storesDF), 'storeId'). This ensures that the smaller DataFrame storesDF is broadcasted to all nodes, making the join with the larger DataFrame employeesDF efficient.

ryanmu · Answer

Correct answer is A. storesDF is smaller and should be broadcasted.

cookiemonster42 · Answer

the larger dataset has to be the initial and the smaller one should be broadcasted

Ram459 · Answer

smaller dataset needs to be broadcasted

veli4ko · Answer

А is the correct answer!

thanab · Answer

A
The correct answer is:

A. 1. employeesDF
2. broadcast
3. storesDF

So the correct code would be:

```scala
employeesDF.join(broadcast(storesDF), "storeId")
```

This code will perform a broadcast join of the DataFrame `storesDF` (which is smaller) with the much larger DataFrame `employeesDF` using the key column `storeId`. The `broadcast()` function is used to mark a DataFrame to be broadcast when performing a join operation. The smaller DataFrame `storesDF` is broadcasted to all nodes, where it's joined with the larger DataFrame `employeesDF`.

azure_bimonster · Answer

I would go with A as storesDF is smaller and right one to broadcast

Certified Associate Developer for Apache Spark Exam - Question 92

Discussion