Certified Associate Developer for Apache Spark Exam QuestionsBrowse all questions from this exam

Certified Associate Developer for Apache Spark Exam - Question 117


Which of the following describes the difference between DataFrame.repartition(n) and DataFrame.coalesce(n)?

Show Answer
Correct Answer: AD

DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions. The repartition() method involves a full shuffle to redistribute data evenly, while coalesce() is an optimized way to reduce the number of partitions without a full shuffle, which can result in uneven data distribution.

Discussion

5 comments
Sign in to comment
thanabOption: A
Sep 9, 2023

A The correct answer is A. DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions. The `repartition()` method can be used to increase or decrease the number of partitions in a DataFrame, while the `coalesce()` method is used to only decrease the number of partitions in an efficient way². The `repartition()` method does a full shuffle and creates new partitions with data that's distributed evenly. On the other hand, `coalesce()` avoids a full shuffle by allowing only the reduction of partitions.

cookiemonster42Option: A
Aug 2, 2023

IMO it's A: B - repartition is less efficient because it involves shuffling - ->false C - same for the B reason --> false D - it's because of shuffling, not because of some column --> false E - coalesce if more fast --> false E -

gwq1968
Aug 6, 2023

A is correct

saryuOption: A
Feb 2, 2024

It's A

siva1280Option: A
Mar 31, 2024

A is correct

SaiPavan10Option: A
Apr 4, 2024

A is the right choice