Certified Data Engineer Professional Exam - Question 76

Question

A data pipeline uses Structured Streaming to ingest data from Apache Kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka-generated timestamp, key, and value. Three months after the pipeline is deployed, the data engineering team has noticed some latency issues during certain times of the day.

A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recorded by Apache Spark) as well as the Kafka topic and partition. The team plans to use these additional metadata fields to diagnose the transient processing delays.

Which limitation will the team face while diagnosing this problem?

Examice · Accepted Answer

When updating the schema of a Delta table to include new fields, these fields will only be populated for new records ingested after the schema update. The new fields cannot be retroactively computed or added to historic records already stored in the Delta table. Therefore, while the new metadata fields like the current timestamp, Kafka topic, and partition will be available for diagnosing issues in data ingested after the update, they will not exist for the historical data, limiting the diagnostic scope to newly ingested records.

dmov · Answer

Looks like A to me.  Does anyone think otherwise?

vctrhugo · Answer

When the schema of a Delta table is updated to include new fields, these fields will only be populated for new records ingested after the schema update. The new fields will not be retroactively computed for historic records already stored in the Delta table. Therefore, the additional metadata fields (current timestamp, Kafka topic, and partition) will not exist in the historic data, limiting the scope of the diagnosis to new data ingested after the schema update.

Certified Data Engineer Professional Exam - Question 76

Discussion