Certified Associate Developer for Apache Spark Exam QuestionsBrowse all questions from this exam

Certified Associate Developer for Apache Spark Exam - Question 25


Which of the following code blocks returns a DataFrame where column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory?

A sample of DataFrame storesDF is displayed below:

Show Answer
Correct Answer: C

The correct code block returns a DataFrame where the column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory. The split function, combined with the col function, is used to split the values in the storeCategory column by the underscore character. The correct code is: storesDF.withColumn('storeValueCategory', split(col('storeCategory'), '_')[0]).withColumn('storeSizeCategory', split(col('storeCategory'), '_')[1]). Here, split(col('storeCategory'), '_') splits the values in the storeCategory column by the '_' character and returns an array of strings. Index [0] and [1] are used to select the first and second elements of the resulting array, which are then assigned to the new columns storeValueCategory and storeSizeCategory, respectively.

Discussion

4 comments
Sign in to comment
ronfun
Apr 9, 2023

Both C or D are correct. Function split accepts both col and str. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.split.html?highlight=split#pyspark.sql.functions.split

4be8126
Apr 26, 2023

Option D is not correct because the split function should be used with the col function to split the values in a column. In option D, the split function is used with a string literal rather than a column, which will result in an error.

NickWerbung
Jul 1, 2023

Both C or D are correct!

4be8126Option: C
Apr 26, 2023

Option C returns a DataFrame where column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory. The correct code is: (storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "_")[0]) .withColumn("storeSizeCategory", split(col("storeCategory"), "_")[1])) Explanation: split(col("storeCategory"), "_") splits the values in column storeCategory by the "_" character and returns an array of strings. [0] gets the first element of the resulting array and assigns it to the new column storeValueCategory. [1] gets the second element of the resulting array and assigns it to the new column storeSizeCategory. withColumn is used to create the new columns and returns a new DataFrame.

newusernameOption: C
Sep 11, 2023

C you can check, by running the code below: from pyspark.sql import SparkSession # Initialize Spark session spark = SparkSession.builder.appName("split_test").getOrCreate() # Create synthetic data data = [ {"storeCategory": "value1_size1"}, {"storeCategory": "value2_size2"}, {"storeCategory": "value3_size3"}, ] storesDF = spark.createDataFrame(data) storesDF.show() from pyspark.sql.functions import split, col # Option C newDF = (storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "_")[0]) .withColumn("storeSizeCategory", split(col("storeCategory"), "_")[1])) newDF.show()

zozoshankyOption: C
Jul 30, 2023

c is correct