Certified Associate Developer for Apache Spark Exam - Question 25

Question

Which of the following code blocks returns a DataFrame where column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory?

A sample of DataFrame storesDF is displayed below:

Examice · Accepted Answer

The correct code block returns a DataFrame where the column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory. The split function, combined with the col function, is used to split the values in the storeCategory column by the underscore character. The correct code is: storesDF.withColumn('storeValueCategory', split(col('storeCategory'), '_')[0]).withColumn('storeSizeCategory', split(col('storeCategory'), '_')[1]). Here, split(col('storeCategory'), '_') splits the values in the storeCategory column by the '_' character and returns an array of strings. Index [0] and [1] are used to select the first and second elements of the resulting array, which are then assigned to the new columns storeValueCategory and storeSizeCategory, respectively.

ronfun · Answer

Both C or D are correct. Function split accepts both col and str.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.split.html?highlight=split#pyspark.sql.functions.split

4be8126 · Answer

Option C returns a DataFrame where column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory.

The correct code is:

(storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "_")[0])
.withColumn("storeSizeCategory", split(col("storeCategory"), "_")[1]))

Explanation:

split(col("storeCategory"), "_") splits the values in column storeCategory by the "_" character and returns an array of strings.

[0] gets the first element of the resulting array and assigns it to the new column storeValueCategory.

[1] gets the second element of the resulting array and assigns it to the new column storeSizeCategory.

withColumn is used to create the new columns and returns a new DataFrame.

newusername · Answer

C
you can check, by running the code below:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("split_test").getOrCreate()

# Create synthetic data
data = [
    {"storeCategory": "value1_size1"},
    {"storeCategory": "value2_size2"},
    {"storeCategory": "value3_size3"},
]

storesDF = spark.createDataFrame(data)
storesDF.show()

from pyspark.sql.functions import split, col

# Option C

newDF = (storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "_")[0])
.withColumn("storeSizeCategory", split(col("storeCategory"), "_")[1]))
newDF.show()

zozoshanky · Answer

c is correct

Certified Associate Developer for Apache Spark Exam - Question 25

Discussion