Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 132

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

    Correct Answer: D

    To achieve the best performance without shuffling data while ensuring the desired part-file size of 512 MB, the ideal strategy is to ingest the data, execute the narrow transformations, and then repartition the data to 2,048 partitions. This approach directly sets the number of partitions based on the target size (1TB divided by 512MB), ensuring that the data is correctly partitioned for writing out to Parquet. By setting the partitions to 2,048, each partition should be approximately 512 MB, leading to efficient processing without the need for additional shuffling.

Discussion
03355a2Option: A

best performance without shuffling data

hpkrOption: D

option D

FreyrOption: D

Correct Answer D: Repartition to 2,048 partitions and write to Parquet This option directly controls the number of output files by repartitioning the data into 2,048 partitions, assuming that 1TB/512MB per file roughly translates to 2,048 files. Repartitioning the data involves shuffling, but it's a deliberate shuffle designed to achieve a specific partitioning beneficial for writing. After repartitioning, the data is written to Parquet files, each expected to be approximately 512 MB if the data is uniformly distributed across partitions.