Certified Associate Developer for Apache Spark Exam - Question 29

Question

The code block shown contains an error. The code block is intended to return a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000. Identify the error.

A sample of DataFrame storesDF is displayed below:

Code block:

storesDF.na.fill(30000, col("sqft"))

Examice · Accepted Answer

The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object. The correct way to replace missing values in the 'sqft' column with 30,000 in PySpark is to use a string for the column name, like so: storesDF.na.fill(30000, 'sqft'). This will correctly fill the missing values in the specified column.

ZSun · Answer

Correct anwser is A.
even for most updated version, spark 3.4. na.fill() still functioning, it is an alias of fillna()
Mr. 4be8126 , 你可真是张嘴就来啊

peekaboo15 · Answer

the answer should be A. See this link for reference
https://sparkbyexamples.com/pyspark/pyspark-fillna-fill-replace-null-values/

4be8126 · Answer

The correct answer is either A or E, depending on the version of Spark being used.

In Spark 2.x, the correct method to replace missing values is na.fill(). Option A is correct in Spark 2.x, as it correctly specifies the column to apply the fill operation to using a Column object.

However, in Spark 3.x, the method has been renamed to fillna(). Therefore, in Spark 3.x, the correct answer is E, as it uses the correct method name.

Both A and E accomplish the same task of replacing missing values in the sqft column with 30,000, so either answer can be considered correct depending on the version of Spark being used.

TC007 · Answer

The error in the code block is that the method na.fill() should be replaced by fillna() to fill the missing values in the column "sqft" with the value 30,000.

azure_bimonster · Answer

We don't need any replacement here. A would be correct.

In PySpark both fillna() and fill() are used to replace missing or null values of a DataFrame. Functionally they both perform same. One can choose either of these based on preference. These are used mainly for handling missing data in PySpark.

Certified Associate Developer for Apache Spark Exam - Question 29

Discussion