Certified Associate Developer for Apache Spark Exam - Question 26

Question

Which of the following code blocks returns a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF?

A sample of storesDF is displayed below:

Examice · Accepted Answer

The code block that returns a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than the original DataFrame, is the one using the explode function correctly with the col function. The explode function is used to transform a column containing arrays into multiple rows, with each element of the array in its own row. The correct usage in PySpark's withColumn method requires the column object to be passed, which is done using col("productCategories"). This ensures the proper transformation of the productCategories column into multiple rows.

NickWerbung · Answer

Both A and E are correct.

newusername · Answer

A is correct, use below code to test:
from pyspark.sql import SparkSession

# Initializing Spark session
spark = SparkSession.builder.appName("test").getOrCreate()

# 1. Creating DataFrame with an array column
data_array = [
    (1, ["electronics", "clothes", "toys"]),
    (2, ["groceries", "electronics"]),
    (3, ["books", "clothes"]),
]

storesDF = spark.createDataFrame(data_array, ["ID", "productCategories"])
storesDF.show()

df_array = storesDF.withColumn("productCategories", explode(col("productCategories")))
df_array.show()

mhaskins · Answer

While the Explode function allows for a str or Column input, this requires the col() wrapper because it is used in a withColumn() call, where the 2nd parameter requires the column object.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn

4be8126 · Answer

Option A is correct: storesDF.withColumn("productCategories", explode(col("productCategories"))).

Explanation:

The explode function is used to transform a column of arrays or maps into multiple rows, one for each element in the array or map. In this case, productCategories is a column with arrays of strings.

The withColumn function is used to add a new column or update an existing column. The first argument is the name of the new or existing column, and the second argument is the expression that defines the values for the column.

arturffsi · Answer

Both A and E are correct according to the new version

bettermakeme · Answer

A and E are correct

Certified Associate Developer for Apache Spark Exam - Question 26

Discussion