Certified Associate Developer for Apache Spark Exam QuestionsBrowse all questions from this exam

Certified Associate Developer for Apache Spark Exam - Question 26


Which of the following code blocks returns a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF?

A sample of storesDF is displayed below:

Show Answer
Correct Answer: A

The code block that returns a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than the original DataFrame, is the one using the explode function correctly with the col function. The explode function is used to transform a column containing arrays into multiple rows, with each element of the array in its own row. The correct usage in PySpark's withColumn method requires the column object to be passed, which is done using col("productCategories"). This ensures the proper transformation of the productCategories column into multiple rows.

Discussion

6 comments
Sign in to comment
NickWerbung
Jul 1, 2023

Both A and E are correct.

newusernameOption: A
Sep 11, 2023

A is correct, use below code to test: from pyspark.sql import SparkSession # Initializing Spark session spark = SparkSession.builder.appName("test").getOrCreate() # 1. Creating DataFrame with an array column data_array = [ (1, ["electronics", "clothes", "toys"]), (2, ["groceries", "electronics"]), (3, ["books", "clothes"]), ] storesDF = spark.createDataFrame(data_array, ["ID", "productCategories"]) storesDF.show() df_array = storesDF.withColumn("productCategories", explode(col("productCategories"))) df_array.show()

newusername
Sep 11, 2023

But E works as well, sadly. What has to be chosen then? from pyspark.sql import SparkSession # Initializing Spark session spark = SparkSession.builder.appName("test").getOrCreate() # 1. Creating DataFrame with an array column data_array = [ (1, ["electronics", "clothes", "toys"]), (2, ["groceries", "electronics"]), (3, ["books", "clothes"]), ] storesDF = spark.createDataFrame(data_array, ["ID", "productCategories"]) storesDF.show() #df_array = storesDF.withColumn("productCategories", explode(col("productCategories"))) #df_array.show() #check E df_array = storesDF.withColumn("productCategories", explode("productCategories")) df_array.show()

newusername
Nov 6, 2023

E for 3.0

mhaskinsOption: A
May 22, 2023

While the Explode function allows for a str or Column input, this requires the col() wrapper because it is used in a withColumn() call, where the 2nd parameter requires the column object. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn

4be8126Option: A
Apr 26, 2023

Option A is correct: storesDF.withColumn("productCategories", explode(col("productCategories"))). Explanation: The explode function is used to transform a column of arrays or maps into multiple rows, one for each element in the array or map. In this case, productCategories is a column with arrays of strings. The withColumn function is used to add a new column or update an existing column. The first argument is the name of the new or existing column, and the second argument is the expression that defines the values for the column.

arturffsiOption: E
Mar 7, 2024

Both A and E are correct according to the new version

bettermakeme
Apr 3, 2024

A and E are correct