Professional Machine Learning Engineer Exam - Question 278

Question

You need to train an XGBoost model on a small dataset. Your training code requires custom dependencies. You want to minimize the startup time of your training job. How should you set up your Vertex AI custom training job?

Examice · Accepted Answer

To minimize the startup time of your training job, you should store the data in a Cloud Storage bucket and create a custom container with your training application. This approach ensures that the container image remains lean by not bundling the data within the container, leading to faster download and startup times. The small dataset can be efficiently read from Cloud Storage during training, while the custom container allows you to include any necessary dependencies for your training code.

guilhermebutzke · Answer

My Answer: A

Focus on “training code requires custom dependencies” and “ minimize the startup time of your training job”, the best choice is A because use custom container and read the data from GCS is he faster way

omermahgoub · Answer

Given the focus on minimizing startup time, and based on the information about XGBoost prebuilt container dependencies available here https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#xgboost

A: Separate Data and Custom Container is the best approach for minimizing startup time, especially for small datasets. Separating data in Cloud Storage keeps the container image lean, leading to faster download and startup compared to bundling data within the container.
B. The prebuilt Container could have unnecessary components, potentially increasing the image size and impacting startup time.

Yan_X · Answer

B

XGBoost prebuilt customer container already includes XGBoost library and all of its dependencies.
Python source distribution to avoid overhead of reading the data from Cloud storage the 2nd time.
Load data to a Pandas DataFrame is convenient to work with Python. Pandas is for data analysis and manipulation.

omribt · Answer

The focus is on startup time, and the dataset is small, so the container should still be of reasonable size.
Downloading data from Cloud Storage introduces a delay.

CHARLIE2108 · Answer

Why not C?

bobjr · Answer

The dataset is small, xgboost is implemented in python...

bobjr · Answer

The dataset is small, xgboost is implemented in python... (correcting my error A answer)

Professional Machine Learning Engineer Exam - Question 278

Discussion