Professional Machine Learning Engineer Exam - Question 208

Question

You recently developed a wide and deep model in TensorFlow. You generated training datasets using a SQL script that preprocessed raw data in BigQuery by performing instance-level transformations of the data. You need to create a training pipeline to retrain the model on a weekly basis. The trained model will be used to generate daily recommendations. You want to minimize model development and training time. How should you develop the training pipeline?

Examice · Accepted Answer

To develop an efficient training pipeline that minimizes model development and training time while incorporating instance-level transformations using SQL scripts in BigQuery, the best approach is to use the Kubeflow Pipelines SDK with components designed for this purpose. The BigQueryJobOp component efficiently handles the preprocessing script within BigQuery, which makes use of the existing SQL script. The CustomTrainingJobOp component is used to launch a Vertex AI training job, enabling the integration of the pre-trained TensorFlow model. This approach leverages the strengths of Kubeflow Pipelines for machine learning workflows while ensuring minimal additional complexity and efficient preprocessing and retraining processes.

guilhermebutzke · Answer

My Answer: A
According with this documentation:
https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/overview

A: CORRECT: BigQueryJobOp for running the existing preprocessing script that already resides there, CustomTrainingJobOp for launching custom training jobs on Vertex AI, which aligns with the requirement of using the pre-trained TensorFlow model.

B:  Not Correct: While DataflowPythonJobOp can be used for preprocessingthis increasing development time compared to the simpler BigQueryJobOp approach.

C and D: Not Correct: While possible, using the TensorFlow Extended SDK with its components introduces unnecessary complexity for this specific scenario. For example, why use ExampleGen? Implementing preprocessing within the model's input_fn is generally not recommended due to potential efficiency drawbacks and training-serving skew issues.

BlehMaks · Answer

D is wrong. Google doesn't recommend to use input_fn for preprocessing
https://www.tensorflow.org/tfx/guide/tft_bestpractices#preprocessing_options_summary

pikachu007 · Answer

Addressing Limitations of Other Options:

Kubeflow Pipelines (A and B): While Kubeflow offers flexibility, it might require more setup and configuration, potentially increasing development time compared to TFX's integrated approach.
Separate Preprocessing (C): Using a separate Transform component for preprocessing can add complexity and potential overheads, especially for instance-level transformations that can often be directly integrated within the model's input pipeline.

Carlose2108 · Answer

Why not C?

Shark0 · Answer

Given the requirement to minimize model development and training time while creating a training pipeline for a wide and deep model trained on datasets preprocessed using a SQL script in BigQuery, the most suitable option is:

C. Use the TensorFlow Extended SDK to implement the pipeline. Use the ExampleGen component with the BigQuery executor to ingest the data, the Transform component to preprocess the data, and the Trainer component to launch a Vertex AI training job.

This option leverages TensorFlow Extended (TFX), which is designed for scalable and production-ready machine learning pipelines. The ExampleGen component with the BigQuery executor efficiently ingests data from BigQuery. The Transform component applies preprocessing steps to the data, and the Trainer component launches a Vertex AI training job, minimizing the time and effort required for model development and training.

pinimichele01 · Answer

agree with guilhermebutzke

gscharly · Answer

agree with guilhermebutzke

SausageMuffins · Answer

Example Gen directly ingest data from BigQuery and the transform component makes it more efficient than using an input fn.

I chose C over A and B because kubeflow pipelines is more sophisticated and requires more setup and effort because of it's customizability.

TanTran04 · Answer

I go with A
Kubeflow Pipelines SDK: supports machine learning and includes components specifically for tasks like data preprocessing, model training, and validation.

BigQueryJobOp: enabling you to preprocess data using SQL scripts efficiently within BigQuery.

Professional Machine Learning Engineer Exam - Question 208

Discussion