Certified Data Engineer Professional Exam - Question 70

Question

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Examice · Accepted Answer

Ingesting the data, executing the narrow transformations, and then repartitioning to 2,048 partitions ensures that each partition will approximate the target part-file size of 512 MB when writing to Parquet. This strategy directly controls the number of output partitions without requiring any shuffling, which can be expensive in terms of performance.

aragorn_brego · Answer

This strategy aims to control the size of the output Parquet files without shuffling the data. The spark.sql.files.maxPartitionBytes parameter sets the maximum size of a partition that Spark will read. By setting it to 512 MB, you are aligning the read partition size with the desired output file size. Since the transformations are narrow (meaning they do not require shuffling), the number of partitions should roughly correspond to the number of output files when writing out to Parquet, assuming the data is evenly distributed and there is no data expansion during processing.

Def21 · Answer

D is the only one that does the trick.

Note, we can not do shuffling.

Wrong answers:

A: spark.sql.files.maxPartitionBytes is about reading, not writing.(The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. )

B: spark.sql.adaptive.advisoryPartitionSizeInBytes takes effect while shuffling and sorting does not make sense (The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.)

C: Would work but spark.sql.adaptive.advisoryPartitionSizeInBytes would need shuffling.

E. spark.sql.shuffle.partitions (Configures the number of partitions to use when shuffling data for joins or aggregations.) is not about writing.

petrv · Answer

Here's a breakdown of the reasons:

spark.sql.adaptive.advisoryPartitionSizeInBytes: This configuration parameter is designed to provide advisory partition sizes for the adaptive query execution framework. It can help in controlling the partition sizes without triggering unnecessary shuffling.

coalesce(2048): Coalescing to a specific number of partitions after the narrow transformations allows you to control the number of output files without triggering a shuffle. This helps achieve the target part-file size without incurring the overhead of a full shuffle.

Setting a specific target: The strategy outlines the goal of achieving a target part-file size of 512 MB, which aligns with the requirement.

vctrhugo · Answer

This approach ensures that each partition will be approximately the target part-file size, which can improve the efficiency of the data write. It also avoids the need for a shuffle operation, which can be expensive in terms of performance.

sturcu · Answer

The number of output files saved to the disk is equal to the number of partitions in the Spark executors when the write operation is performed.

divingbell17 · Answer

A is correct. 
The question states Which strategy will yield the best performance without shuffling data.
The other options involve shuffling either manually or through AQE

spaceexplorer · Answer

Rest of the answers trigger shuffles

ocaj90 · Answer

obviously D. It allows you to control both the number of partitions and the final part-file size, which aligns with the requirements. Option B shuffles partitions, which is not allowed.

alexvno · Answer

- spark.sql.files.maxPartitionBytes: 128MB (The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.)

911land · Answer

C is correct answer

adenis · Answer

С is correct

Curious76 · Answer

D is mot suitable.

hal2401me · Answer

ChatGPT says D: This strategy directly addresses the desired part-file size by repartitioning the data. It avoids shuffling during narrow transformations.
Recommended for achieving the desired part-file size without unnecessary shuffling.

vikram12apr · Answer

D is not correct as it will create 2048 target files of 0.5 MB each
Only A will do the job as it will read this file in 2  partition ( 1 TB = 512*2 MB) and as we are not doing any shuffling(not mentioned in option) it will create those many partition file i.e 2 part files

Certified Data Engineer Professional Exam - Question 70

Discussion