Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 296


Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on- premises. You want to store the data in BigQuery, with as minimal latency as possible. What should you do?

Show Answer
Correct Answer: BC

To create a high-throughput streaming pipeline with minimal latency from an on-premises Apache Kafka cluster to BigQuery, the most suitable approach is to use Dataflow. Writing a pipeline that reads data directly from Kafka and writes it to BigQuery ensures minimal processing overhead and eliminates intermediate steps. This direct approach minimizes potential delays introduced by additional layers, such as message replication to Pub/Sub. Dataflow is capable of handling the Kafka stream efficiently, thus providing a streamlined and low-latency data ingestion pipeline.

Discussion

9 comments
Sign in to comment
scaenruyOption: C
Jan 4, 2024

C. Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.

rahulvinOption: C
Dec 30, 2023

Dataflow has templates to read from Kafka. Other options are too complicated https://cloud.google.com/dataflow/docs/kafka-dataflow

Sofiia98
Jan 10, 2024

so, this is the answer A, whe C?

Matt_108
Jan 13, 2024

Yeah, the answer is A. C requires you to develop the pipeline yourself and ensure minimal latency, which means that you perform better than a pre-built template from Google...not really the case most of the times :)

saschak94
Jan 29, 2024

but Option A introduces additional replication into Pub/Sub and the question states with minimal latency. In my opinion subscribing to Kafka via Dataflow has a lower latency than replicating the messages first to Pub/Sub and subscribing with Dataflow to it.

Matt_108Option: A
Jan 13, 2024

Option A, leverage dataflow template for Kafka https://cloud.google.com/dataflow/docs/kafka-dataflow

AllenChen123
Jan 21, 2024

Agree. "Google provides a Dataflow template that configures a Kafka-to-BigQuery pipeline. The template uses the BigQueryIO connector provided in the Apache Beam SDK."

ML6
Feb 18, 2024

But it includes setting up a Kafka Connect bridge while an interconnect link has already been set up. https://cloud.google.com/dataflow/docs/kafka-dataflow#connect_to_an_external_cluster

T2ClubberOption: C
Feb 2, 2024

Option C makes more sense to me because of the "minimal latency as possible". I would have chosen option A if it were "less CODING as possible".

MaxNRGOption: D
Feb 26, 2024

Based on the key requirements highlighted: • Interconnect link between GCP and on-prem Kafka • High throughput streaming pipeline • Minimal latency • Data to be stored in BigQuery D - The key reasons this meets the requirements: • Kafka connect provides a reliable bridge to Pub/Sub over the interconnect • Reading from Pub/Sub minimizes latency vs reading directly from Kafka • Dataflow provides a high throughput streaming engine • Writing to BigQuery gives scalable data storage By leveraging these fully managed GCP services over the dedicated interconnect, a low latency streaming pipeline from on-prem Kafka into BigQuery can be implemented rapidly. Options A/B/C have higher latencies or custom code requirements, so do not meet the minimal latency criteria as well as option D.

MaxNRG
Feb 26, 2024

Why not C: At first option C (using a Dataflow pipeline to directly read from Kafka and write to BigQuery) seems reasonable. However, the key requirement stated in the question is to have minimal latency for the streaming pipeline. By reading directly from Kafka within Dataflow, there can be additional latency and processing overhead compared to reading from Pub/Sub, for a few reasons: 1. Pub/Sub acts as a buffer and handles scaling/reliability of streaming data automatically. This reduces processing burden on the pipeline. 2. Network latency can be lower by leveraging Pub/Sub instead of making constant pull requests for data from Kafka within the streaming pipeline. 3. Any failures have to be handled within the pipeline code itself when reading directly from Kafka. With Pub/Sub, reliability is built-in.

MaxNRG
Feb 26, 2024

So in summary, while option C is technically possible, option D introduces Pub/Sub as a streaming buffer which reduces overall latency for the pipeline, allowing the key requirement of minimal latency to be better satisfied.

SanjeevRoy91
Mar 26, 2024

You are adding an intermediate hop in between on prem kafka and Dataflow ( pubsub ). Why won't this add additional latency.

MaxNRG
Feb 26, 2024

Why choose option D over A? The key advantage with option D is that by writing a custom Dataflow pipeline rather than using a Google provided template, there is more flexibility to customize performance tuning and optimization for lowest latency. • Some potential optimizations: • Fine tuning number of workers, machine types to meet specific throughput targets • Custom data parsing/processing logic if applicable • Experimenting with autoscaling parameters or triggers

MaxNRG
Feb 26, 2024

The Google template may be easier to set up initially, but a custom pipeline provides more control over optimizations specifically for low latency requirements stated in the question. That being said, option A would still work reasonably well - but option D allows squeezing out that extra bit of performance if low millisecond latency is absolutely critical in the pipeline through precise tuning. So in summary, option A is easier to implement but option D provides more optimization flexibility for ultra low latency streaming requirements.

Moss2011Option: C
Mar 1, 2024

From my point of view, the best option is C taking into account this doc: https://cloud.google.com/dataflow/docs/kafka-dataflow

JyoGCPOption: C
Feb 21, 2024

A Vs C -- Not sure which one would have low latency. Points related to option C: "Yes, Dataflow can read events from Kafka. Dataflow is a fully-managed, serverless streaming analytics service that supports both batch and stream processing. It can read events from Kafka, process them, and write the results to a BigQuery table for further analysis. " "Dataflow supports Kafka support, which was added to Apache Beam in 2016. Google provides a Dataflow template that configures a Kafka-to-BigQuery pipeline. The template uses the BigQueryIO connector provided in the Apache Beam SDK. "

JyoGCP
Feb 22, 2024

Going with C

DarkLord2104
Feb 22, 2024

Final???

anushree09
Apr 11, 2024

per the text below at https://cloud.google.com/dataflow/docs/kafka-dataflow - "Alternatively, you might have an existing Kafka cluster that resides outside of Google Cloud. For example, you might have an existing workload that is deployed on-premises or in another public cloud."

Anudeep58Option: C
Jul 6, 2024

Latency: Option C, with direct integration between Kafka and Dataflow, offers lower latency by eliminating intermediate steps. Flexibility: Custom Dataflow pipelines (Option C) provide more control over data processing and optimization compared to using a pre-built template.