Professional Data Engineer Exam - Question 282

Question

You are using a Dataflow streaming job to read messages from a message bus that does not support exactly-once delivery. Your job then applies some transformations, and loads the result into BigQuery. You want to ensure that your data is being streamed into BigQuery with exactly-once delivery semantics. You expect your ingestion throughput into BigQuery to be about 1. 5 GB per second. What should you do?

Examice · Accepted Answer

To achieve exactly-once delivery semantics while ingesting data into BigQuery, the BigQuery Storage Write API should be used. This API is explicitly designed for high-throughput and low-latency data ingestion, including tools to prevent data duplication which is crucial for exactly-once delivery. A regional target BigQuery table is recommended as it can provide better performance and lower latency if the Dataflow job is in the same region, which aligns well with the expected ingestion throughput of 1.5 GB per second.

AlizCert · Answer

It should B, Storage Write API has "3 GB per second throughput in multi-regions; 300 MB per second in regions"

raaad · Answer

- BigQuery Storage Write API: This API is designed for high-throughput, low-latency writing of data into BigQuery. It also provides tools to prevent data duplication, which is essential for exactly-once delivery semantics.
- Regional Table: Choosing a regional location for the BigQuery table could potentially provide better performance and lower latency, as it would be closer to the Dataflow job if they are in the same region.

SamuelTsch · Answer

looking for this documentation https://cloud.google.com/bigquery/quotas#write-api-limits. 3 GB/s in multi-regions; 300MB/s in regions

Siahara · Answer

A. Implement the BigQuery Storage Write API and guarantee that the target BigQuery table is regional.

Here's the breakdown:

Why Option A is Superior

Exactly-Once Delivery: The BigQuery Storage Write API intrinsically supports exactly-once delivery using stream offsets. This guarantees that each message is written to BigQuery exactly one time, even in the case of retries due to the lack of native exactly-once support in your message bus.

High Throughput: The Storage Write API is optimized for high-throughput scenarios. It can handle the expected ingestion throughput of 1.5 GB per second.

Regional Tables: Using a regional BigQuery table aligns with best practices when utilizing the Storage Write API, as it helps to minimize latency and reduce potential cross-region communication costs.

Ed_Kim · Answer

Voting on A

HermanTan · Answer

To ensure that analysts do not see customer data older than 30 days while minimizing cost and overhead, the best option is:
B. Use a timestamp range filter in the query to fetch the customer’s data for a specific range.

This approach directly addresses the issue by filtering out data older than 30 days at query time, ensuring that only the relevant data is retrieved. It avoids the overhead and potential delays associated with garbage collection and manual deletion processes

CloudAdrMX · Answer

According to this documentation, its B
https://cloud.google.com/bigquery/quotas#write-api-limits

NatyNogas · Answer

- Choosing a regional target BigQuery table ensures that data is stored redundantly in a single region, providing high availability and durability.

m_a_p_s · Answer

streamed into BigQuery with exactly-once delivery semantics >>> Storage Write API

ingestion throughput into BigQuery to be about 1.5 GB per second >>> multiregional (check throughput rate here >>> https://cloud.google.com/bigquery/quotas#write-api-limits)

himadri1983 · Answer

3 GB per second throughput in multi-regions; 300 MB per second in regions
https://cloud.google.com/bigquery/quotas#write-api-limits

Pime13 · Answer

https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery
For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll method. The Storage Write API has lower pricing and more robust features, including exactly-once delivery semantics
https://cloud.google.com/bigquery/docs/write-api#advantages

Matt_108 · Answer

Option A

hanoverquay · Answer

option D

imazy · Answer

Write API support 2.5 GB / sec speed and support  exactly-once delivery semantics
https://cloud.google.com/bigquery/docs/write-api#connections

whereas in streaming duplicates can come and needed to remove them manually 
https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery#dataavailability

hussain.sain · Answer

B is correct.
When aiming for exactly-once delivery in a Dataflow streaming job, the key is to use the BigQuery Storage Write API, as it provides the capability to handle large-scale data ingestion with the correct semantics, including exactly-once delivery.

juliorevk · Answer

- BigQuery Storage Write API: This API is designed for high-throughput, low-latency writing of data into BigQuery. It also provides tools to prevent data duplication, which is essential for exactly-once delivery semantics.
- The multiregional table ensures that your data is highly available and can be streamed into BigQuery across multiple regions. It is better suited for high-throughput and low-latency workloads, as it provides distributed write capabilities that can handle large data volumes, such as the 1.5 GB per second you expect to stream.

gabbferreira · Answer

It’s A

Aungshuman · Answer

As per GCP document multi-region meets the troughput requirement.

aditya_ali · Answer

You need a write latency of 1.5 GBs per second. Given the high throughput requirement, a regional BigQuery table (Option A) is generally preferred over a multi-regional table due to potentially lower write latency in multi-region. Simple.

Professional Data Engineer Exam - Question 282

Discussion