Certified Associate Developer for Apache Spark Exam QuestionsBrowse all questions from this exam

Certified Associate Developer for Apache Spark Exam - Question 29


The code block shown contains an error. The code block is intended to return a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000. Identify the error.

A sample of DataFrame storesDF is displayed below:

Code block:

storesDF.na.fill(30000, col("sqft"))

Show Answer
Correct Answer: A

The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object. The correct way to replace missing values in the 'sqft' column with 30,000 in PySpark is to use a string for the column name, like so: storesDF.na.fill(30000, 'sqft'). This will correctly fill the missing values in the specified column.

Discussion

5 comments
Sign in to comment
ZSunOption: A
Jun 7, 2023

Correct anwser is A. even for most updated version, spark 3.4. na.fill() still functioning, it is an alias of fillna() Mr. 4be8126 , 你可真是张嘴就来啊

peekaboo15Option: A
Apr 13, 2023

the answer should be A. See this link for reference https://sparkbyexamples.com/pyspark/pyspark-fillna-fill-replace-null-values/

4be8126Option: E
Apr 26, 2023

The correct answer is either A or E, depending on the version of Spark being used. In Spark 2.x, the correct method to replace missing values is na.fill(). Option A is correct in Spark 2.x, as it correctly specifies the column to apply the fill operation to using a Column object. However, in Spark 3.x, the method has been renamed to fillna(). Therefore, in Spark 3.x, the correct answer is E, as it uses the correct method name. Both A and E accomplish the same task of replacing missing values in the sqft column with 30,000, so either answer can be considered correct depending on the version of Spark being used.

TC007Option: E
Apr 3, 2023

The error in the code block is that the method na.fill() should be replaced by fillna() to fill the missing values in the column "sqft" with the value 30,000.

azure_bimonsterOption: A
Feb 8, 2024

We don't need any replacement here. A would be correct. In PySpark both fillna() and fill() are used to replace missing or null values of a DataFrame. Functionally they both perform same. One can choose either of these based on preference. These are used mainly for handling missing data in PySpark.