Certified Associate Developer for Apache Spark Exam - Question 73

Question

Which of the following operations can be used to return a DataFrame with no duplicate rows? Please select the most complete answer.

Examice · Accepted Answer

To return a DataFrame with no duplicate rows in PySpark, both DataFrame.distinct() and DataFrame.dropDuplicates() can be used. Although DataFrame.drop_duplicates() is commonly used in pandas, it is not typically used in PySpark for this purpose. Therefore, the most complete and accurate answer is the combination of DataFrame.dropDuplicates() and DataFrame.distinct().

thanab · Answer

B
The most complete answer is B. DataFrame.dropDuplicates() and DataFrame.distinct(). Both DataFrame.distinct() and DataFrame.dropDuplicates() methods in PySpark can be used to return a new DataFrame with duplicate rows removed. The DataFrame.drop_duplicates() method is used in pandas, not in PySpark.

azure_bimonster · Answer

it asks "most complete" one, so E would be correct as all these three options would work in pyspark

Ahlo · Answer

Answer E
drop_duplicates() is an alias for dropDuplicates() it also work in pyspark

Certified Associate Developer for Apache Spark Exam - Question 73

Discussion