Certified Associate Developer for Apache Spark Exam - Question 32

Question

The code block shown below contains an error. The code block is intended to return a new DataFrame with the mean of column sqft from DataFrame storesDF in column sqftMean. Identify the error.

Code block:

storesDF.agg(mean("sqft").alias("sqftMean"))

Examice · Accepted Answer

The argument to the mean() operation should be a Column object rather than a string column name. The mean function in PySpark's sql.functions module is designed to operate on a Column object, not a string column name. Therefore, the correct approach is to use the col() function to convert the string column name into a Column object before passing it to the mean function. The code block should be written as storesDF.agg(mean(col('sqft')).alias('sqftMean')).

4be8126 · Answer

The code block shown is correct and should return a new DataFrame with the mean of column sqft from DataFrame storesDF in column sqftMean. Therefore, the answer is E - none of the options identify a valid error in the code block.

Here's an explanation for each option:

A. The argument to the mean() operation can be either a Column object or a string column name, so there is no error in using a string column name in this case.

E. This option is incorrect because the code block shown is a valid way to compute the mean of a column using PySpark. Another way to compute the mean of a column is with the mean() method from a DataFrame, but that doesn't mean the code block shown is invalid.

Mohitsain · Answer

agg is not required here.

cookiemonster42 · Answer

There's a similar question in the official Databricks samples and the right answer there is: 
Code block:
storesDF.__1__(__2__(__3__).alias("sqftMean"))
A.
1. agg
2. mean
3. col("sqft")

If we stick to this logic, the answer is A.

thanab · Answer

A.
A
The error in the code block is **A**, the argument to the `mean` operation should be a Column object rather than a string column name. The `mean` function takes a Column object as an argument, not a string column name. To fix the error, the code block should be rewritten as `storesDF.agg(mean(col("sqft")).alias("sqftMean"))`, where the `col` function is used to create a Column object from the string column name `"sqft"`.

Here is the correct code
storesDF.agg(mean(col("sqft")).alias("sqftMean"))

juliom6 · Answer

Correct answer is A:

from pyspark.sql.functions import col, mean

students =[
{'rollno':'001','name':'sravan','sqft':23, 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','sqft':16, 'height':3.79,'weight':34,'address':'hyd'}]
storesDF = spark.createDataFrame( students)
storesDF.agg(mean(col('sqft')).alias('sqftMean')).show()

azurearch · Answer

from pyspark.sql.functions import col, mean

students =[
{'rollno':'001','name':'sravan','sqft':23, 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','sqft':16, 'height':3.79,'weight':34,'address':'hyd'}]
storesDF = spark.createDataFrame( students)
storesDF.agg(mean('sqft').alias('sqftMean')).show()

this works as well! not sure which one is wrong then

ajayrtk · Answer

The error in the code is A. The argument to the mean() operation should be a Column object rather than a string column name.
In the provided code block, "sqft" is passed as a string column name to the mean() function. However, the correct approach is to use a Column object. This can be achieved by referencing the column using the storesDF DataFrame and the col() function. Here's the corrected code:
storesDF.agg(mean(col("sqft")).alias("sqftMean"))

zozoshanky · Answer

df.agg(mean("amountpaid").alias("amountpaid")).show()
df.agg(mean(col("amountpaid")).alias("sqftMean")).show(). Both produces the result

halouanne · Answer

The correct answer is:

B. The argument to the mean() operation should not be quoted.

In the context of Apache Spark, the mean function takes a column name as its argument. Therefore, you would write it without quotes. The corrected code line would look something like this:

juadaves · Answer

D
withColumn() for new calculated column.

outwalker · Answer

it appears that there might be some flexibility in how the mean function can be used with either a string column name or a col() function. However, the most accurate and recommended approach is to use the col() function to create a Column object explicitly.

With this in mind, the best choice is:

A. The argument to the mean() operation should be a Column object rather than a string column name. The mean function takes a Column object as an argument, not a string column name. To fix the error, the code block should be rewritten as storesDF.agg(mean(col("sqft")).alias("sqftMean")), where the col function is used to create a Column object from the string column name "sqft".

While there might be situations where using a string column name works, following the standard practice of creating a Column object with col() ensures compatibility and clarity in code.

Saurabh_prep · Answer

A) should be the one considering databricks practice pdf. mean() function should take col object as input.

azure_bimonster · Answer

A is most like correct here

Certified Associate Developer for Apache Spark Exam - Question 32

Discussion