Certified Associate Developer for Apache Spark Exam - Question 58

Question

In what order should the below lines of code be run in order to write DataFrame storesDF to file path filePath as parquet and partition by values in column division?

Lines of code:

1. .write() \

2. .partitionBy("division") \

3. .parquet(filePath)

4. .storesDF \

5. .repartition("division")

6. .write \

7. .path(filePath, "parquet")

Examice · Accepted Answer

To save a DataFrame as a partitioned parquet file by a specific column using PySpark, you should follow these steps in order: 1) Start with the DataFrame (.storesDF), 2) Use .write to initiate the write process, 3) Use .partitionBy to specify the column by which to partition the data, 4) Finally, use .parquet to write the DataFrame to the specified file path. Hence, the correct order is storesDF, .write, .partitionBy('division'), and .parquet(filePath).

newusername · Answer

Correct

juliom6 · Answer

C is correct:

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.parquet.html

Certified Associate Developer for Apache Spark Exam - Question 58

Discussion