Certified Associate Developer for Apache Spark Exam - Question 23

Question

Which of the following code blocks returns a new DataFrame with a new column employeesPerSqft that is the quotient of column numberOfEmployees and column sqft, both of which are from DataFrame storesDF? Note that column employeesPerSqft is not in the original DataFrame storesDF.

Examice · Accepted Answer

The correct code block to return a new DataFrame with a new column employeesPerSqft, which is the quotient of the column numberOfEmployees and the column sqft from DataFrame storesDF, uses the withColumn() method. The withColumn() function allows the addition of a new column to the DataFrame. The correct syntax is storesDF.withColumn("employeesPerSqft", col("numberOfEmployees") / col("sqft")). This operation correctly creates a new column employeesPerSqft by dividing the values in the numberOfEmployees column by those in the sqft column.

SonicBoom10C9 · Answer

C, D are wrong as exmployeesPerSqft cannot be selected, it doesn't exist. Also, that is not proper select syntax anyway. B does not select existing columns using col(), and E refers to employeesPerSqft as an existing column; also, it cannot be the first argument for withColumn().

4be8126 · Answer

The correct code block to return a new DataFrame with a new column employeesPerSqft that is the quotient of column numberOfEmployees and column sqft from DataFrame storesDF is:

storesDF.withColumn("employeesPerSqft", col("numberOfEmployees") / col("sqft"))

Option A correctly uses the withColumn() function to create a new column employeesPerSqft by dividing column numberOfEmployees by column sqft.

Option B has a syntax error because it uses quotation marks to reference column names instead of the col() function.

Option C also has a syntax error because it uses quotation marks to reference column names instead of the col() function, and also uses the select() function instead of withColumn() to create a new column.

Option D correctly references column names using col() and uses the select() function to return a DataFrame with only the two selected columns.

Option E has a syntax error where col() is used as a first argument instead of a second argument for the withColumn() function.

Therefore, the correct answer is A.

storesDF.withColumn("employeesPerSqft", col("numberOfEmployees") / col("sqft"))

4be8126 · Answer

storesDF.select("employeesPerSqft", col("numberOfEmployees") / col("sqft"))

This code block selects the columns "employeesPerSqft" and the quotient of "numberOfEmployees" and "sqft" from the DataFrame storesDF. However, since "employeesPerSqft" is not a column in the original storesDF, this code block would throw an error.

To create a new column "employeesPerSqft" in the resulting DataFrame, we need to use the withColumn() method instead of select(). Here's the corrected code block:

storesDF.withColumn("employeesPerSqft", col("numberOfEmployees") / col("sqft"))

This code block adds a new column "employeesPerSqft" to the storesDF DataFrame. The new column is created by dividing the values in column "numberOfEmployees" by the values in column "sqft".

newusername · Answer

Test:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initializing Spark session (if not already initialized)
spark = SparkSession.builder.appName("databricks_example").getOrCreate()

# Creating some synthetic data for storesDF
data = [
    {"storeId": 1, "numberOfEmployees": 10, "sqft": 500},
    {"storeId": 2, "numberOfEmployees": 15, "sqft": 750},
    {"storeId": 3, "numberOfEmployees": 8, "sqft": 400}
]

storesDF = spark.createDataFrame(data)

# Option A:
try:
    df_a = storesDF.withColumn("employeesPerSqft", col("numberOfEmployees") / col("sqft"))
    df_a.show()
    print("Option A works")
except Exception as e:
    print("Option A doesn't work:", str(e))

Certified Associate Developer for Apache Spark Exam - Question 23

Discussion