Exam Certified Associate Developer for Apache Spark All QuestionsBrowse all questions from this exam
Question 73

Which of the following operations can be used to return a DataFrame with no duplicate rows? Please select the most complete answer.

    Correct Answer: B

    To return a DataFrame with no duplicate rows in PySpark, both DataFrame.distinct() and DataFrame.dropDuplicates() can be used. Although DataFrame.drop_duplicates() is commonly used in pandas, it is not typically used in PySpark for this purpose. Therefore, the most complete and accurate answer is the combination of DataFrame.dropDuplicates() and DataFrame.distinct().

Discussion
AhloOption: E

Answer E drop_duplicates() is an alias for dropDuplicates() it also work in pyspark

azure_bimonsterOption: E

it asks "most complete" one, so E would be correct as all these three options would work in pyspark

thanabOption: B

B The most complete answer is B. DataFrame.dropDuplicates() and DataFrame.distinct(). Both DataFrame.distinct() and DataFrame.dropDuplicates() methods in PySpark can be used to return a new DataFrame with duplicate rows removed. The DataFrame.drop_duplicates() method is used in pandas, not in PySpark.

juadaves

It should be E, drop_duplicates() works in pyspark too.