Which of the following operations can be used to return a new DataFrame from DataFrame storesDF without inducing a shuffle?
Which of the following operations can be used to return a new DataFrame from DataFrame storesDF without inducing a shuffle?
storesDF.rdd.getNumPartitions() is a method that simply returns the number of partitions that the underlying RDD of the DataFrame has. This operation does not alter the DataFrame or cause any data movement or shuffle across the cluster. As such, it does not induce a shuffle and returns a new DataFrame reflecting the number of partitions without modifying the original data.
Though Union does not cause a shuffle, you need another dataframe to do union. in this question its limited to storesDF. coalesce(1) is the correct answer, as it does not cause shuffle rather combines multiple partitions into 1, i.e. reducing partitions = no shuffle. execute storedDF.coalesce(1) and check DAG
The correct answer is D. coalesce() can be used to return a new DataFrame with a reduced number of partitions, without inducing a shuffle. A shuffle is an expensive operation that involves the redistribution of data across a cluster, so it's important to minimize its use whenever possible. In this case, repartition() and union() both involve shuffles, while intersect() returns only the common rows between two DataFrames, and rdd.getNumPartitions() returns the number of partitions in the RDD underlying the DataFrame.
union is the only operation from mentioned here that won't do shuffling. And as @ZSun mentioned, do not follow any of the 4be8126 answers, they are all blindly from GPT
I think this question contains error, it should not be which one without shuffle, it should be which one cause shuffle. union is a narrow transformation, not causing shuffle. coalesce simply combine partitions together into one, not shuffle them. rdd.getNumPartitions just evaluate the number of partition of a dataframe, no shuffle. even for repartition(1), since there is only one partition in the end, it also not causing shuffle, it simply combine all partition together. Therefore, it should be A, this is the only one inducing a shuffle. or, B C D E without inducing a shuffle
The Answer is C. Union rather than coalesce. Union is a narrow transformation. unlike wide transformationl, narrow transformation does not require shuffle. Coalesce is wide transformation, combine multiple partition to smaller number of partition. Don't this process require shuffling partition together? if you ask ChatGPT, it will tell you what 4be8126 comment.
This is incorrect explanation, delete it
The problem with Union answer is that it returns an error if we run it without arg.
Answer C : union Narrow transformation - all transformation logic performed within one partition Wide transformations - transformation during which is needed shuffle/exchange, distribution of data to other partitions Union is narrow transaction
C is the correct coalesce may induce a partial shuffle