Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 20


A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.

The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

Show Answer
Correct Answer: E

In a scenario where multiple Structured Streaming jobs are concurrently writing to a Delta table, each stream needs to maintain its own checkpoint directory. Checkpoints are used to store the streaming state and metadata, ensuring data consistency and recovery in case of failures. Sharing a single checkpoint directory between multiple streams can lead to data corruption and inconsistencies because the checkpointing mechanism is not designed to handle concurrent access by multiple streams. Therefore, the checkpoint directory structure depicted is not valid, and each stream should have its own checkpoint directory.

Discussion

7 comments
Sign in to comment
thxsgodOption: E
Sep 7, 2023

Correct, E. Source: https://docs.databricks.com/en/optimizations/isolation-level.html#:~:text=If%20a%20streaming%20query%20using%20the%20same%20checkpoint%20location%20is%20started%20multiple%20times%20concurrently%20and%20tries%20to%20write%20to%20the%20Delta%20table%20at%20the%20same%20time.%20You%20should%20never%20have%20two%20streaming%20queries%20use%20the%20same%20checkpoint%20location%20and%20run%20at%20the%20same%20time.

Eertyy
Aug 30, 2023

answer is correct

sturcuOption: E
Oct 11, 2023

E is correct. If user wants 1 checkpoint directory then he needs to unions streams before writing.

kz_dataOption: E
Jan 10, 2024

E is correct

Jay_98_11Option: E
Jan 13, 2024

correct E

svikOption: B
May 10, 2024

It is not clear from the question that year_week=2020_01 and year_week=2020_02 are used by stream 1 and stream 2 respectively. If they use the common parent checkpoint directory with individual sub folders for checkpointing, that should work fine. In that case the answer should be B

Kill9
Jun 21, 2024

That are table partitions. They are not used to build checkpoint adress. The adress finish at /bronze

imatheushenriqueOption: E
Jun 1, 2024

E. No; each of the streams needs to have its own checkpoint directory. The checkpoint directory is 1 to 1