Certified Associate Developer for Apache Spark Exam - Question 27

Question

Which of the following code blocks returns a new DataFrame with column storeDescription where the pattern "Description: " has been removed from the beginning of column storeDescription in DataFrame storesDF?

A sample of DataFrame storesDF is below:

Examice · Accepted Answer

To remove the pattern 'Description: ' from the beginning of the column storeDescription in the DataFrame storesDF, the correct approach is to use the regexp_replace function. The syntax for this function is regexp_replace(column, pattern, replacement), where the column is specified with the col function, the pattern is the regular expression to look for, and the replacement is what will replace the found pattern. Option E uses this correct syntax, making it the right choice.

4be8126 · Answer

The correct answer is option E: storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", "")).

This code block uses the withColumn() function to create a new column called storeDescription. It uses the regexp_replace() function to replace the pattern "^Description: " at the beginning of the string in the storeDescription column with an empty string. This effectively removes the pattern from the beginning of the string in each row of the column.

TC007 · Answer

This should actually be D sorry for the wrong answer. refer to this, https://sparkbyexamples.com/pyspark/pyspark-replace-column-values/

ronfun · Answer

Both D and E are correct answer.

pierre_grns · Answer

Correct answer is D.
First, regexp_replace/regexp_extract are from sql.functions. They cannot be applied directly after a column Object => B is incorrect.
Second, regexp_replace/regexp_extract accept a STRING Object as a first argument to specify the column. Check the documentation there : https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions => A, C, E are incorrects.

sly75 · Answer

Correct answer is E indeed
- According to the pyspark doc, the syntax is regexp_replace(str, pattern, replacement)
   -> it means that it's not a function of the column object
- storeDescription is a String field

https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace

zozoshanky · Answer

E is the answer tested

Dgohel · Answer

regexp_replace(str, regexp, rep [, position] )
This is what Databricks documentation says. You guys can debate between D and E but actually question clearly says to remove from the begging of the string. And if you take answer D it takes whole only one constant string “storeDescription” to match pattern and will return empty string after Description for each row.

So if you have debate between D, E then E is the correct answer.

newusername · Answer

Both work:

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace,regexp_extract, col
spark = SparkSession.builder.appName("test").getOrCreate()

data = [
    (1, "Description: This is a tech store. Description: This"),
    (2, "Description: This is a grocery store."),
    (3, "Description: This is a book store."),
]
storesDF = spark.createDataFrame(data, ["storeID", "storeDescription"])
storesDF.show(truncate=False)

#Case D
print ("Case D")
storesDF = storesDF.withColumn("storeDescription", regexp_replace("storeDescription", "^Description: ", ""))
storesDF.show(truncate=False)

#Case E
print ("Case E")
storesDF = storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", "")) 
storesDF.show(truncate=False)

TC007 · Answer

The regexp_replace function is used to remove the pattern "Description: " from the beginning of the column storeDescription. The ^ symbol indicates the beginning of the string, and the pattern "Description: " is replaced with an empty string. This results in a new DataFrame with column storeDescription where the pattern "Description: " has been removed from the beginning of each cell in that column.

4be8126 · Answer

Option A is correct: storesDF.withColumn("productCategories", explode(col("productCategories"))).

Explanation:

The explode function is used to transform a column of arrays or maps into multiple rows, one for each element in the array or map. In this case, productCategories is a column with arrays of strings.

The withColumn function is used to add a new column or update an existing column. The first argument is the name of the new or existing column, and the second argument is the expression that defines the values for the column.

SonicBoom10C9 · Answer

It's between D and E, and D is wrong as there is no replacement string expression (which is a required argument/parameter). Thus, E wins as the correct option.

NickWerbung · Answer

Both D and E are correct.

azure_bimonster · Answer

E is most likely correct in this scenario

arturffsi · Answer

Both D and E are correct according to the new version

Certified Associate Developer for Apache Spark Exam - Question 27

Discussion