Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 235


You want to schedule a number of sequential load and transformation jobs. Data files will be added to a Cloud Storage bucket by an upstream process. There is no fixed schedule for when the new data arrives. Next, a Dataproc job is triggered to perform some transformations and write the data to BigQuery. You then need to run additional transformation jobs in BigQuery. The transformation jobs are different for every table. These jobs might take hours to complete. You need to determine the most efficient and maintainable workflow to process hundreds of tables and provide the freshest data to your end users. What should you do?

Show Answer
Correct Answer: C

To schedule and manage the sequential load and transformation jobs in an efficient and maintainable way, creating a single Apache Airflow DAG that handles all tables within the pipeline is ideal. Given that new data files can arrive at any time, using Cloud Storage object triggers to launch a Cloud Function, which then triggers the DAG, ensures the workflow begins processing immediately when new data is available. By using Dataproc and BigQuery operators, you can handle the necessary transformations efficiently. This approach minimizes complexity and maintenance, as it avoids the need to manage separate DAGs for each table, making it more scalable and easier to maintain.

Discussion

7 comments
Sign in to comment
cuadradobertolinisebastiancamiOption: D
Feb 26, 2024

D * Transformations are in Dataproc and BigQuery. So you don't need operators for GCS (A and B can be discard) * "There is no fixed schedule for when the new data arrives." so you trigger the DAG when a file arrives * "The transformation jobs are different for every table. " so you need a DAG for each table. Then, D is the most suitable answer

Jordan18Option: C
Jan 6, 2024

why not C?

AllenChen123
Jan 14, 2024

Same question, why not use single DAG to manage as there are hundreds of tables.

cuadradobertolinisebastiancami
Feb 26, 2024

It says that the transformations for each table are very different

Matt_108Option: D
Jan 13, 2024

Option D, which gets triggered when the data comes in and accounts for the fact that each table has its own set of transformations

raaadOption: D
Jan 4, 2024

- Option D: Tailored handling and scheduling for each table; triggered by data arrival for more timely and efficient processing.

scaenruyOption: D
Jan 3, 2024

D. 1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators. 2. Create a separate DAG for each table that needs to go through the pipeline. 3. Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG.

JyoGCPOption: D
Feb 18, 2024

Option D

8ad5266Option: C
Jun 26, 2024

This explains why it's not D: maintainable workflow to process hundreds of tables and provide the freshest data to your end users How is creating a DAG for each of the hundreds of tables maintainable?