Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 70


A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Show Answer
Correct Answer: BD

Ingesting the data, executing the narrow transformations, and then repartitioning to 2,048 partitions ensures that each partition will approximate the target part-file size of 512 MB when writing to Parquet. This strategy directly controls the number of output partitions without requiring any shuffling, which can be expensive in terms of performance.

Discussion

14 comments
Sign in to comment
aragorn_bregoOption: A
Nov 21, 2023

This strategy aims to control the size of the output Parquet files without shuffling the data. The spark.sql.files.maxPartitionBytes parameter sets the maximum size of a partition that Spark will read. By setting it to 512 MB, you are aligning the read partition size with the desired output file size. Since the transformations are narrow (meaning they do not require shuffling), the number of partitions should roughly correspond to the number of output files when writing out to Parquet, assuming the data is evenly distributed and there is no data expansion during processing.

Def21Option: D
Jan 24, 2024

D is the only one that does the trick. Note, we can not do shuffling. Wrong answers: A: spark.sql.files.maxPartitionBytes is about reading, not writing.(The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. ) B: spark.sql.adaptive.advisoryPartitionSizeInBytes takes effect while shuffling and sorting does not make sense (The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.) C: Would work but spark.sql.adaptive.advisoryPartitionSizeInBytes would need shuffling. E. spark.sql.shuffle.partitions (Configures the number of partitions to use when shuffling data for joins or aggregations.) is not about writing.

petrvOption: C
Nov 30, 2023

Here's a breakdown of the reasons: spark.sql.adaptive.advisoryPartitionSizeInBytes: This configuration parameter is designed to provide advisory partition sizes for the adaptive query execution framework. It can help in controlling the partition sizes without triggering unnecessary shuffling. coalesce(2048): Coalescing to a specific number of partitions after the narrow transformations allows you to control the number of output files without triggering a shuffle. This helps achieve the target part-file size without incurring the overhead of a full shuffle. Setting a specific target: The strategy outlines the goal of achieving a target part-file size of 512 MB, which aligns with the requirement.

vctrhugoOption: A
Feb 7, 2024

This approach ensures that each partition will be approximately the target part-file size, which can improve the efficiency of the data write. It also avoids the need for a shuffle operation, which can be expensive in terms of performance.

sturcuOption: B
Oct 24, 2023

The number of output files saved to the disk is equal to the number of partitions in the Spark executors when the write operation is performed.

divingbell17Option: A
Jan 1, 2024

A is correct. The question states Which strategy will yield the best performance without shuffling data. The other options involve shuffling either manually or through AQE

spaceexplorerOption: A
Jan 24, 2024

Rest of the answers trigger shuffles

ocaj90Option: D
Nov 14, 2023

obviously D. It allows you to control both the number of partitions and the final part-file size, which aligns with the requirements. Option B shuffles partitions, which is not allowed.

alexvnoOption: A
Dec 18, 2023

- spark.sql.files.maxPartitionBytes: 128MB (The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.)

911landOption: C
Dec 20, 2023

C is correct answer

adenisOption: C
Jan 31, 2024

С is correct

Curious76Option: D
Feb 29, 2024

D is mot suitable.

hal2401meOption: D
Mar 7, 2024

ChatGPT says D: This strategy directly addresses the desired part-file size by repartitioning the data. It avoids shuffling during narrow transformations. Recommended for achieving the desired part-file size without unnecessary shuffling.

vikram12aprOption: A
Mar 9, 2024

D is not correct as it will create 2048 target files of 0.5 MB each Only A will do the job as it will read this file in 2 partition ( 1 TB = 512*2 MB) and as we are not doing any shuffling(not mentioned in option) it will create those many partition file i.e 2 part files

hal2401me
Mar 12, 2024

hey, 1TB=1000GB=1^6MB.