Certified Data Engineer Associate Exam - Question 27

Question

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

Examice · Accepted Answer

Spark Structured Streaming uses checkpointing and write-ahead logs to reliably track the exact progress of the processing and handle any kind of failure. Checkpointing saves the state of the streaming computation periodically to a durable storage system so that in case of failure, processing can be resumed from the last checkpoint. Write-ahead logs record the offset range of the data being processed, which allows for recovery by replaying the logs from the last checkpoint. These methods ensure fault tolerance and exactly-once semantics.

XiltroX · Answer

E is a partial answer. The two correct answers are A and E. Structured streaming is important because it uses these two methods to make sure there is fault tolerance and Exactly-once guarantee of data

chays · Answer

Answer is A:
The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=The%20engine%20uses%20checkpointing%20and,being%20processed%20in%20each%20trigger.

Atnafu · Answer

A. Checkpointing and Write-ahead Logs.

Checkpointing is a process of periodically saving the state of the streaming computation to a durable storage system. This ensures that if the streaming computation fails, it can be restarted from the last checkpoint and resume processing from where it left off.
Write-ahead logs are a type of log that records all changes made to a dataset. This allows Structured Streaming to recover from failures by replaying the write-ahead logs from the last checkpoint.

juadaves · Answer

The answer is Checkpointing and idempotent sinks

How does structured streaming achieves end to end fault tolerance:

First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.

Next, the streaming sinks are designed to be _idempotent_—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink.

Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition.

prasioso · Answer

Answer is A.
From Spark documentation: Every streaming source is assumed to have offsets to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger.

mimzzz · Answer

why i think both A E are correct? https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-streaming-exactly-once#:~:text=Use%20idempotent%20sinks

akk_1289 · Answer

A:   
The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger.
-- in the link search for "The engine uses " youll find the answer.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=The%20engine%20uses%20checkpointing%20and,being%20processed%20in%20each%20trigger.

vctrhugo · Answer

A. Checkpointing and Write-ahead Logs

To reliably track the exact progress of processing and handle failures in Spark Structured Streaming, Spark uses both checkpointing and write-ahead logs. Checkpointing allows Spark to periodically save the state of the streaming application to a reliable distributed file system, which can be used for recovery in case of failures. Write-ahead logs are used to record the offset range of data being processed, ensuring that the system can recover and reprocess data from the last known offset in the event of a failure.

benni_ale · Answer

1 checkpointing and write ahead logs to record the offset range of data being processed 
2 checkpointing and idempotent sinks achieve end to end fault tolerance

squidy24 · Answer

The answer is A

"Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. ... Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs." - Apache Spark Structured Streaming Programming Guide

4be8126 · Answer

The answer is E. Checkpointing and Idempotent Sinks are used by Spark to record the offset range of the data being processed in each trigger. Checkpointing helps to recover the query from the point of failure and Idempotent Sinks ensure that the output of a streaming query is consistent even in the face of failures and retries.

Majjjj · Answer

E. Checkpointing and Idempotent Sinks are the two approaches used by Spark to record the offset range of the data being processed in each trigger, enabling Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Checkpointing periodically checkpoints the state of the streaming query to a fault-tolerant storage system, while idempotent sinks ensure that data can be written multiple times to the sink without affecting the final result.

SerGrey · Answer

Correct is A

bita7 · Answer

The answer is Checkpointing and idempotent sinks (E)
How does structured streaming achieves end to end fault tolerance:
•	First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.
•	Next, the streaming sinks are designed to be _idempotent_—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink.
Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition

keensolution · Answer

Nice information and i hope best [url=https://keensolution.in/data-visualization-services/]Data visualization agencies in India[/url]

3fbc31b · Answer

A is the correct answer.

Certified Data Engineer Associate Exam - Question 27

Discussion