Certified Associate Developer for Apache Spark Exam QuestionsBrowse all questions from this exam

Certified Associate Developer for Apache Spark Exam - Question 53


Which of the following pairs of arguments cannot be used in DataFrame.join() to perform an inner join on two DataFrames, named and aliased with "a" and "b" respectively, to specify two key columns?

Show Answer
Correct Answer: BD

In PySpark, when performing a join operation on DataFrames, it is crucial to specify which DataFrame each column belongs to if columns with the same name exist in both DataFrames to avoid ambiguity. The option with 'on = [col("column1"), col("column2")]' will throw an error because this does not explicitly differentiate between the columns in DataFrame 'a' and those in DataFrame 'b'. The other specified options either reference the columns directly from the aliased DataFrames (as in 'on = [a.column1 == b.column1, a.column2 == b.column2]') or clarify the DataFrame each column belongs to using 'col("a.column1")' and 'col("b.column1")'. Therefore, option B cannot be used to perform an inner join with two key columns.

Discussion

8 comments
Sign in to comment
cookiemonster42Option: C
Aug 3, 2023

should be C as in col() we specify only a column name as a string, not a dataframe

newusernameOption: B
Nov 7, 2023

100% B Below code to test: dataA = [Row(column1=1, column2=2), Row(column1=2, column2=4), Row(column1=3, column2=6)] dfA = spark.createDataFrame(dataA)

newusername
Nov 7, 2023

# Sample data for DataFrame 'b' dataB = [Row(column1=1, column2=2), Row(column1=2, column2=5), Row(column1=3, column2=4)] dfB = spark.createDataFrame(dataB) # Alias DataFrames as 'a' and 'b' a = dfA.alias("a") b = dfB.alias("b") a.show() b.show() #Option A joinedDF_A = a.join(b, [a.column1 == b.column1, a.column2 == b.column2]) joinedDF_A.show() #Option B #joinedDF_B = a.join(b, [col("column1"), col("column2")]) #joinedDF_B.show() #Option C joinedDF_C = a.join(b, [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]) joinedDF_C.show() #Option E joinedDF_E = a.join(b, ["column1", "column2"]) joinedDF_E.show()

juliom6Option: B
Nov 14, 2023

According to the following code, only response B returns an error. The key concept here is that dataframes must be "named" AND "aliased". from pyspark.sql.functions import col a = spark.createDataFrame([(1, 2), (3, 4)], ['column1', 'column2']) b = spark.createDataFrame([(1, 2), (5, 6)], ['column1', 'column2']) a = a.alias('a') b = b.alias('b') df = a.join(b, on = [a.column1 == b.column1, a.column2 == b.column2]) display(df) # df = a.join(b, on = [col("column1"), col("column2")]) df = a.join(b, on = [col("a.column1") == col("b.column1"), col("a.column2") == col("b.column2")]) display(df) df = a.join(b, on = ["column1", "column2"]) display(df)

JticOption: A
May 29, 2023

A. on = [a.column1 == b.column1, a.column2 == b.column2] This option is valid and can be used to perform an inner join on two key columns. It specifies the key columns using the syntax a.column1 == b.column1 and a.column2 == b.column2.

ZSun
Jun 6, 2023

I think the question "which one cannot be used to perform inner join", is confusing, Because only A works, the rest of answer is incorrect. The question should be "which one can be used"

Singh_SumitOption: D
Sep 30, 2023

from pyspark.sql.functions import col df2.alias('a').join(df3.alias('b'), [col("a.name") == col("b.name"), col("a.name") == col("b.name")], 'full_outer').select(df2['name'],'height','age').show() It worked. so every answer is correct.

juadavesOption: B
Oct 18, 2023

I tried all of the options and I got 2 errors from: B AMBIGUOUS_REFERENCE] Reference `Category` is ambiguous, could be: [`Category`, `Category`] C: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `df_1`.`Category` cannot be resolved. Did you mean one of the following? [`Category`, `Category`, `Truth`, `Truth`, `Value`].;

Ahmadkt
Oct 28, 2023

it's B, it seems you didn't do the alias a = df1.alias("a") b = df2.alias("b")

GurdelOption: B
Dec 23, 2023

B throws AnalysisException: [AMBIGUOUS_REFERENCE] Reference `column1` is ambiguous, could be: [`a`.`column1`, `b`.`column1`]

azure_bimonsterOption: B
Feb 8, 2024

B cannot be used as this seems ambiguous