Certified Associate Developer for Apache Spark Exam - Question 55

Question

The code block shown below contains an error. The code block is intended to return a new DataFrame that is the result of a cross join between DataFrame storesDF and DataFrame employeesDF. Identify the error.

Code block:

storesDF.join(employeesDF, "cross")

Examice · Accepted Answer

The error in the code block is that a cross join is not implemented by the DataFrame.join() operation. Instead, the DataFrame.crossJoin() operation should be used. The correct method to perform a cross join in Spark is to use the crossJoin() method provided by the DataFrame class, which is specifically designed for this purpose.

ronfun · Answer

Key is missing. Answer is D.

newusername · Answer

I know it looks confusing to have key column for cross join, but it ijoin method syntaxis: 
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.join.html

see example below : 
dataA = [Row(column1=1, column2=2), Row(column1=2, column2=4), Row(column1=3, column2=6)]
dfA = spark.createDataFrame(dataA)

# Sample data for DataFrame 'b'
dataB = [Row(column1=1, column2=2), Row(column1=2, column2=5), Row(column1=3, column2=4)]
dfB = spark.createDataFrame(dataB)

joinedDF = dfA.join(dfB, on=None, how="cross")
joinedDF.show()

it is possible to do Cross join this way as well DataFrame.crossJoin() but answer C states that df.join () doesn't do cross, which is wrong.

peekaboo15 · Answer

cross join doesn't need a key. Answer is C

4be8126 · Answer

C. A cross join is not implemented by the DataFrame.join()operation – the DataFrame.crossJoin()operation should be used instead.

juliom6 · Answer

C is correct.

# https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.crossJoin.html

a = spark.createDataFrame([(1, 2), (3, 4)], ['column1', 'column2'])
b = spark.createDataFrame([(5, 6), (7, 8)], ['column3', 'column4'])

df = a.crossJoin(b)
display(df)

azure_bimonster · Answer

D is the answer here as key is missing. As per syntax, key is needed.

Ahlo · Answer

Correct answer C 
from pyspark.sql import Row
df = spark.createDataFrame(
    [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
df2 = spark.createDataFrame(
    [Row(height=80, name="Tom"), Row(height=85, name="Bob")])
df.crossJoin(df2.select("height")).select("age", "name", "height").show()

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.crossJoin.html

Certified Associate Developer for Apache Spark Exam - Question 55

Discussion