Certified Associate Developer for Apache Spark Exam - Question 117

Question

Which of the following describes the difference between DataFrame. repartition(n) and DataFrame. coalesce(n)?

Examice · Accepted Answer

DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions. The repartition() method involves a full shuffle to redistribute data evenly, while coalesce() is an optimized way to reduce the number of partitions without a full shuffle, which can result in uneven data distribution.

thanab · Answer

A
The correct answer is A. DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions. The `repartition()` method can be used to increase or decrease the number of partitions in a DataFrame, while the `coalesce()` method is used to only decrease the number of partitions in an efficient way². The `repartition()` method does a full shuffle and creates new partitions with data that's distributed evenly. On the other hand, `coalesce()` avoids a full shuffle by allowing only the reduction of partitions.

cookiemonster42 · Answer

IMO it's A:
B - repartition is less efficient because it involves shuffling - ->false
C - same for the B reason --> false
D - it's because of shuffling, not because of some column --> false
E - coalesce if more fast --> false
E -

saryu · Answer

It's A

siva1280 · Answer

A is correct

SaiPavan10 · Answer

A is the right choice

Certified Associate Developer for Apache Spark Exam - Question 117

Discussion