Which of the following operations can be used to return a DataFrame with no duplicate rows? Please select the most complete answer.
Which of the following operations can be used to return a DataFrame with no duplicate rows? Please select the most complete answer.
To return a DataFrame with no duplicate rows in PySpark, both DataFrame.distinct() and DataFrame.dropDuplicates() can be used. Although DataFrame.drop_duplicates() is commonly used in pandas, it is not typically used in PySpark for this purpose. Therefore, the most complete and accurate answer is the combination of DataFrame.dropDuplicates() and DataFrame.distinct().
Answer E drop_duplicates() is an alias for dropDuplicates() it also work in pyspark
it asks "most complete" one, so E would be correct as all these three options would work in pyspark
B The most complete answer is B. DataFrame.dropDuplicates() and DataFrame.distinct(). Both DataFrame.distinct() and DataFrame.dropDuplicates() methods in PySpark can be used to return a new DataFrame with duplicate rows removed. The DataFrame.drop_duplicates() method is used in pandas, not in PySpark.
It should be E, drop_duplicates() works in pyspark too.