Professional Data Engineer Exam - Question 296

Question

Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on- premises. You want to store the data in BigQuery, with as minimal latency as possible. What should you do?

Examice · Accepted Answer

To create a high-throughput streaming pipeline with minimal latency from an on-premises Apache Kafka cluster to BigQuery, the most suitable approach is to use Dataflow. Writing a pipeline that reads data directly from Kafka and writes it to BigQuery ensures minimal processing overhead and eliminates intermediate steps. This direct approach minimizes potential delays introduced by additional layers, such as message replication to Pub/Sub. Dataflow is capable of handling the Kafka stream efficiently, thus providing a streamlined and low-latency data ingestion pipeline.

scaenruy · Answer

C. Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.

rahulvin · Answer

Dataflow has templates to read from Kafka. Other options are too complicated
https://cloud.google.com/dataflow/docs/kafka-dataflow

Matt_108 · Answer

Option A, leverage dataflow template for Kafka https://cloud.google.com/dataflow/docs/kafka-dataflow

T2Clubber · Answer

Option C makes more sense to me because of the "minimal latency as possible".
I would have chosen option A if it were "less CODING as possible".

MaxNRG · Answer

Based on the key requirements highlighted:
•	Interconnect link between GCP and on-prem Kafka
•	High throughput streaming pipeline
•	Minimal latency
•	Data to be stored in BigQuery
D - The key reasons this meets the requirements:
•	Kafka connect provides a reliable bridge to Pub/Sub over the interconnect
•	Reading from Pub/Sub minimizes latency vs reading directly from Kafka
•	Dataflow provides a high throughput streaming engine
•	Writing to BigQuery gives scalable data storage
By leveraging these fully managed GCP services over the dedicated interconnect, a low latency streaming pipeline from on-prem Kafka into BigQuery can be implemented rapidly.
Options A/B/C have higher latencies or custom code requirements, so do not meet the minimal latency criteria as well as option D.

Moss2011 · Answer

From my point of view, the best option is C taking into account this doc: https://cloud.google.com/dataflow/docs/kafka-dataflow

JyoGCP · Answer

A Vs C  -- Not sure which one would have low latency.

Points related to option C:
"Yes, Dataflow can read events from Kafka. Dataflow is a fully-managed, serverless streaming analytics service that supports both batch and stream processing. It can read events from Kafka, process them, and write the results to a BigQuery table for further analysis. "

"Dataflow supports Kafka support, which was added to Apache Beam in 2016. Google provides a Dataflow template that configures a Kafka-to-BigQuery pipeline. The template uses the BigQueryIO connector provided in the Apache Beam SDK. "

anushree09 · Answer

per the text below at https://cloud.google.com/dataflow/docs/kafka-dataflow -

"Alternatively, you might have an existing Kafka cluster that resides outside of Google Cloud. For example, you might have an existing workload that is deployed on-premises or in another public cloud."

Anudeep58 · Answer

Latency: Option C, with direct integration between Kafka and Dataflow, offers lower latency by eliminating intermediate steps.
Flexibility: Custom Dataflow pipelines (Option C) provide more control over data processing and optimization compared to using a pre-built template.

Professional Data Engineer Exam - Question 296

Discussion