Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 268


You created a new version of a Dataflow streaming data ingestion pipeline that reads from Pub/Sub and writes to BigQuery. The previous version of the pipeline that runs in production uses a 5-minute window for processing. You need to deploy the new version of the pipeline without losing any data, creating inconsistencies, or increasing the processing latency by more than 10 minutes. What should you do?

Show Answer
Correct Answer: C

Draining the old pipeline ensures that it finishes processing all in-flight data before stopping, which prevents data loss and inconsistencies. After draining, you can then start the new pipeline, which will begin processing new data from where the old pipeline left off. This approach maintains a smooth transition between the old and new versions, minimizing latency increases and avoiding data gaps or overlaps.

Discussion

7 comments
Sign in to comment
raaadOption: C
Jan 5, 2024

- Graceful Data Transition: Draining the old pipeline ensures it processes all existing data in its buffers and watermarks before shutting down, preventing data loss or inconsistencies. - Minimal Latency Increase: The latency increase will be limited to the amount of time it takes to drain the old pipeline, typically within the acceptable 10-minute threshold.

AlizCertOption: B
Feb 11, 2024

I don't think C is correct, as it will immediately fire the window: "Draining can result in partially filled windows. In that case, if you restart the drained pipeline, the same window might fire a second time, which can cause issues with your data. " https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#effects Maybe "A" means launching a replacement job? https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#Launching

d11379b
Mar 24, 2024

So why not B it is the better choice to save intermediate state and easy to use

scaenruyOption: C
Jan 3, 2024

C. Drain the old pipeline, then start the new pipeline.

d11379bOption: B
Mar 24, 2024

I would choose B as mentioned by Alizcert, a simple drain may cause problem Dataflow snapshots save the state of a streaming pipeline, which lets you start a new version of your Dataflow job without losing state. Snapshots are useful for backup and recovery, testing and rolling back updates to streaming pipelines, and other similar scenarios.

Matt_108Option: C
Jan 13, 2024

Option C, draining the old pipeline solves all requests

hanoverquayOption: C
Mar 16, 2024

C option

Ouss_123Option: C
Jun 12, 2024

- Draining the old pipeline ensures that it finishes processing all in-flight data before stopping, which prevents data loss and inconsistencies. - After draining, you can start the new pipeline, which will begin processing new data from where the old pipeline left off. - This approach maintains a smooth transition between the old and new versions, minimizing latency increases and avoiding data gaps or overlaps. ==> Other options, such as updating, snapshotting, or canceling, might not provide the same level of consistency and could lead to data loss or increased latency beyond the acceptable 10-minute window. Draining is the safest method to ensure a seamless transition.