Certified Associate Developer for Apache Spark Exam - Question 53

Question

Which of the following pairs of arguments cannot be used in DataFrame. join() to perform an inner join on two DataFrames, named and aliased with "a" and "b" respectively, to specify two key columns?

Examice · Accepted Answer

In PySpark, when performing a join operation on DataFrames, it is crucial to specify which DataFrame each column belongs to if columns with the same name exist in both DataFrames to avoid ambiguity. The option with 'on = [col("column1"), col("column2")]' will throw an error because this does not explicitly differentiate between the columns in DataFrame 'a' and those in DataFrame 'b'. The other specified options either reference the columns directly from the aliased DataFrames (as in 'on = [a.column1 == b.column1, a.column2 == b.column2]') or clarify the DataFrame each column belongs to using 'col("a.column1")' and 'col("b.column1")'. Therefore, option B cannot be used to perform an inner join with two key columns.

cookiemonster42 · Answer

should be C as in col() we specify only a column name as a string, not a dataframe

newusername · Answer

100% B 
Below code to test:

dataA = [Row(column1=1, column2=2), Row(column1=2, column2=4), Row(column1=3, column2=6)]
dfA = spark.createDataFrame(dataA)

juliom6 · Answer

According to the following code, only response B returns an error. The key concept here is that dataframes must be "named" AND "aliased".

from pyspark.sql.functions import col

a = spark.createDataFrame([(1, 2), (3, 4)], ['column1', 'column2'])
b = spark.createDataFrame([(1, 2), (5, 6)], ['column1', 'column2'])

a = a.alias('a')
b = b.alias('b')

df = a.join(b, on = [a.column1 == b.column1, a.column2 == b.column2])
display(df)
# df = a.join(b, on = [col("column1"), col("column2")])
df = a.join(b, on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")])
display(df)
df = a.join(b, on = ["column1", "column2"])
display(df)

Jtic · Answer

A. on = [a.column1 == b.column1, a.column2 == b.column2]
This option is valid and can be used to perform an inner join on two key columns. It specifies the key columns using the syntax a.column1 == b.column1 and a.column2 == b.column2.

Singh_Sumit · Answer

from pyspark.sql.functions import col
df2.alias('a').join(df3.alias('b'),
         [col("a.name") == col("b.name"), col("a.name") == col("b.name")],
         'full_outer').select(df2['name'],'height','age').show()
 It worked. so every answer is correct.

juadaves · Answer

I tried all of the options and I got 2 errors from:

B
AMBIGUOUS_REFERENCE] Reference `Category` is ambiguous, could be: [`Category`, `Category`]

C:
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `df_1`.`Category` cannot be resolved.

Did you mean one of the following? [`Category`, `Category`, `Truth`, `Truth`, `Value`].;

Gurdel · Answer

B throws AnalysisException: [AMBIGUOUS_REFERENCE] Reference `column1` is ambiguous, could be: [`a`.`column1`, `b`.`column1`]

azure_bimonster · Answer

B cannot be used as this seems ambiguous

Certified Associate Developer for Apache Spark Exam - Question 53

Discussion