Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 39


You work with a data engineering team that has developed a pipeline to clean your dataset and save it in a Cloud Storage bucket. You have created an ML model and want to use the data to refresh your model as soon as new data is available. As part of your CI/CD workflow, you want to automatically run a Kubeflow

Pipelines training job on Google Kubernetes Engine (GKE). How should you architect this workflow?

Show Answer
Correct Answer: C

To automatically run a Kubeflow Pipelines training job on Google Kubernetes Engine (GKE) as soon as new data is available, the workflow should use an event-driven architecture. Configure a Cloud Storage trigger to send a message to a Pub/Sub topic when a new file is available in a storage bucket. Then, use a Pub/Sub-triggered Cloud Function to start the training job on a GKE cluster. This approach ensures that the training job is initiated immediately when new data is available, avoiding the inefficiencies of polling (option B) or scheduling regular checks (option D), and eliminates the need to re-engineer the data pipeline (option A).

Discussion

10 comments
Sign in to comment
Paul_Dirac
Dec 26, 2021

C https://cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build#triggering-and-scheduling-kubeflow-pipelines

Paul_Dirac
Dec 26, 2021

C https://cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build#triggering-and-scheduling-kubeflow-pipelines

ori5225
Feb 12, 2022

On a schedule, using Cloud Scheduler. Responding to an event, using Pub/Sub and Cloud Functions. For example, the event can be the availability of new data files in a Cloud Storage bucket.

tavva_prudhvi
Jan 2, 2024

Option D requires the job to be scheduled at regular intervals, even if there are no new files. This can waste resources and lead to unnecessary delays in the training process.

tavva_prudhvi
Jan 2, 2024

Option D requires the job to be scheduled at regular intervals, even if there are no new files. This can waste resources and lead to unnecessary delays in the training process.

hiromiOption: C
Jun 9, 2023

C Pubsub is the keyword

Mohamed_MossadOption: C
Jan 9, 2023

event driven architecture is better than polling based architecure so I will vote for C

behzadswOption: A
Jul 4, 2023

The question says: As part of your CI/CD workflow, you want to automatically run a Kubeflow.. C is also an option but it seems more cumbersome. One thing hat could be against A is that the data engineering team is separate team so they might not access your CI/CD if any changes from their side is needed..

tavva_prudhvi
Jan 2, 2024

Option A requires the data engineering team to modify the pipeline, which can be time-consuming and error-prone.

FatiyOption: C
Aug 28, 2023

The scenario involves automatically running a Kubeflow Pipelines training job on GKE as soon as new data becomes available. To achieve this, we can use Cloud Storage to store the cleaned dataset, and then configure a Cloud Storage trigger that sends a message to a Pub/Sub topic whenever a new file is added to the storage bucket. We can then create a Pub/Sub-triggered Cloud Function that starts the training job on a GKE cluster.

M25Option: C
Nov 9, 2023

Went with C

Sum_SumOption: C
May 15, 2024

C- because you don't want to re-engineer the pipeline

fragkrisOption: C
Jun 5, 2024

C - This is the google reccomended method.

PhilipKokuOption: C
Dec 6, 2024

C) PUB/sub trigger from Cloud Storage & Cloud Function