Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 13


An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.

Which solution meets these requirements?

Show Answer
Correct Answer: E

The solution should involve ingesting all log information into a bronze table to preserve the raw change data capture (CDC) logs. Using the MERGE INTO operation will then allow the system to insert, update, or delete the most recent entry for each pk_id into a silver table, effectively reconstructing the current table state. This approach ensures the bronze table maintains a full history of changes for auditing purposes, while the silver table provides the most recent values for analytical needs.

Discussion

8 comments
Sign in to comment
EertyyOption: B
Aug 27, 2023

The answer given is correct

Eertyy
Sep 21, 2023

I want to correct my response.It seems the right answer Option D, it leverages Delta Lake's built-in capabilities for handling CDC data. It is designed to efficiently capture, process, and propagate changes, making it a more robust and scalable solution, particularly for large-scale data scenarios with frequent updates and auditing requirements.

Starvosxant
Oct 9, 2023

Databricks is NOT able to process CDC alone. It needs a intermediare Tool to make it on an object storage and then ingest it. So how can be D?

sturcu
Oct 11, 2023

the D states: process CDC data from an external system. so this delta CDF.

sturcuOption: E
Oct 11, 2023

E is correct

hamzaKhribiOption: D
Dec 2, 2023

For me the answer is D, the question states that CDC logs are emitted on an external storage meaning it can be ingested into the bronze layer on a table with CDF enabled. In this case we let databricks handle the complexity of following changes and only worry about data quality. meaning with CDF enabled databricks will already work the audit data for us with the table_changes of the pre-image and post-image and also give us the last updated value for our use case. here is a similar example: https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html

spaceexplorer
Jan 24, 2024

This article shows exactly why D is not right. Since "CDF captures changes only from a Delta table and is only forward-looking once enabled."

a560fe1Option: E
Jan 9, 2024

CDF captures changes only from a Delta table and is only forward-looking once enabled. The CDC logs are writing to object storage. So you would need to ingestion those and merge into downstream tables, hence the answer is E

spaceexplorerOption: E
Jan 24, 2024

The answer is E

kz_dataOption: E
Jan 10, 2024

Answer E is correct, as the CDC captured from the external database may contain duplicates for the same pk_id (key) due to multiple updates within the processed hour, we need to take the most recent update for the pk_id, and then MERGE into a silver table.

RafaelCFCOption: E
Jan 11, 2024

Complimenting kz_data's response, be aware that the data that is being consumed is not a Databrick's CDC data feed object, but rather, CDC coming from somewhere else, that is, just regular data. So, indeed, it can't be processed without another tool.

imatheushenriqueOption: E
Jun 1, 2024

E. This is the correct answer because it meets the requirements of maintaining a full record of all values that have ever been valid in the source system and recreating the current table state with only the most recent value for each record. The code ingests all log information into a bronze table, which preserves the raw CDC data as it is. Then, it uses merge into to perform an upsert operation on a silver table, which means it will insert new records or update or delete existing records based on the change type and the pk_id columns. This way, the silver table will always reflect the current state of the source table, while the bronze table will keep the history of all changes.