Certified Data Engineer Professional Exam - Question 132

Question

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Examice · Accepted Answer

To achieve the best performance without shuffling data while ensuring the desired part-file size of 512 MB, the ideal strategy is to ingest the data, execute the narrow transformations, and then repartition the data to 2,048 partitions. This approach directly sets the number of partitions based on the target size (1TB divided by 512MB), ensuring that the data is correctly partitioned for writing out to Parquet. By setting the partitions to 2,048, each partition should be approximately 512 MB, leading to efficient processing without the need for additional shuffling.

03355a2 · Answer

best performance without shuffling data

Freyr · Answer

Correct Answer D: Repartition to 2,048 partitions and write to Parquet

This option directly controls the number of output files by repartitioning the data into 2,048 partitions, assuming that 1TB/512MB per file roughly translates to 2,048 files. Repartitioning the data involves shuffling, but it's a deliberate shuffle designed to achieve a specific partitioning beneficial for writing. After repartitioning, the data is written to Parquet files, each expected to be approximately 512 MB if the data is uniformly distributed across partitions.

hpkr · Answer

option D

Certified Data Engineer Professional Exam - Question 132

Discussion