Certified Data Engineer Professional Exam - Question 17

Question

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

Examice · Accepted Answer

Databricks has likely autotuned to a smaller target file size based on the amount of data in each partition. Given that Auto Optimize and Auto Compaction are enabled, the system continuously manages file sizes to ensure efficient processing. The observed file sizes under 64 MB, despite each partition containing at least 1 GB of data and the overall table size exceeding 10 TB, suggest that the tuning is influenced by the per-partition data amounts to optimize the incremental updates efficiently.

cotardo2077 · Answer

https://docs.databricks.com/en/delta/tune-file-size.html#autotune-table 'Autotune file size based on workload'

Eertyy · Answer

E is right answer

Jay_98_11 · Answer

A is correct

PrashantTiwari · Answer

The target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB.
Correct answer is A

azurearch · Answer

A is correct answer

sturcu · Answer

Correct

sen411 · Answer

E is the right answer, because the question is why there are small files

BIKRAM063 · Answer

Auto Optimize reduces file size less than 128MB to facilitate quick merge

kz_data · Answer

correct answer is A

AziLa · Answer

correct ans is A

RiktRikt007 · Answer

how A is correct ? While Databricks does have autotuning capabilities, it primarily considers the table size. In this case, the table is over 10 TB, which would typically lead to a target file size of 1 GB, not under 64 MB.

imatheushenrique · Answer

One of the purposes of a optimize execution is the gain in merge oprations, so:
A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

Certified Data Engineer Professional Exam - Question 17

Discussion