Exam Certified Associate Developer for Apache Spark All QuestionsBrowse all questions from this exam
Question 29

The code block shown contains an error. The code block is intended to return a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000. Identify the error.

A sample of DataFrame storesDF is displayed below:

Code block:

storesDF.na.fill(30000, col("sqft"))

    Correct Answer: A

    The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object. The correct way to replace missing values in the 'sqft' column with 30,000 in PySpark is to use a string for the column name, like so: storesDF.na.fill(30000, 'sqft'). This will correctly fill the missing values in the specified column.

Discussion
ZSunOption: A

Correct anwser is A. even for most updated version, spark 3.4. na.fill() still functioning, it is an alias of fillna() Mr. 4be8126 , 你可真是张嘴就来啊

4be8126Option: E

The correct answer is either A or E, depending on the version of Spark being used. In Spark 2.x, the correct method to replace missing values is na.fill(). Option A is correct in Spark 2.x, as it correctly specifies the column to apply the fill operation to using a Column object. However, in Spark 3.x, the method has been renamed to fillna(). Therefore, in Spark 3.x, the correct answer is E, as it uses the correct method name. Both A and E accomplish the same task of replacing missing values in the sqft column with 30,000, so either answer can be considered correct depending on the version of Spark being used.

peekaboo15Option: A

the answer should be A. See this link for reference https://sparkbyexamples.com/pyspark/pyspark-fillna-fill-replace-null-values/

azure_bimonsterOption: A

We don't need any replacement here. A would be correct. In PySpark both fillna() and fill() are used to replace missing or null values of a DataFrame. Functionally they both perform same. One can choose either of these based on preference. These are used mainly for handling missing data in PySpark.

TC007Option: E

The error in the code block is that the method na.fill() should be replaced by fillna() to fill the missing values in the column "sqft" with the value 30,000.