Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 278


You need to train an XGBoost model on a small dataset. Your training code requires custom dependencies. You want to minimize the startup time of your training job. How should you set up your Vertex AI custom training job?

Show Answer
Correct Answer: AB

To minimize the startup time of your training job, you should store the data in a Cloud Storage bucket and create a custom container with your training application. This approach ensures that the container image remains lean by not bundling the data within the container, leading to faster download and startup times. The small dataset can be efficiently read from Cloud Storage during training, while the custom container allows you to include any necessary dependencies for your training code.

Discussion

7 comments
Sign in to comment
guilhermebutzkeOption: A
Feb 19, 2024

My Answer: A Focus on “training code requires custom dependencies” and “ minimize the startup time of your training job”, the best choice is A because use custom container and read the data from GCS is he faster way

omermahgoubOption: A
Apr 13, 2024

Given the focus on minimizing startup time, and based on the information about XGBoost prebuilt container dependencies available here https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#xgboost A: Separate Data and Custom Container is the best approach for minimizing startup time, especially for small datasets. Separating data in Cloud Storage keeps the container image lean, leading to faster download and startup compared to bundling data within the container. B. The prebuilt Container could have unnecessary components, potentially increasing the image size and impacting startup time.

Yan_XOption: B
Mar 8, 2024

B XGBoost prebuilt customer container already includes XGBoost library and all of its dependencies. Python source distribution to avoid overhead of reading the data from Cloud storage the 2nd time. Load data to a Pandas DataFrame is convenient to work with Python. Pandas is for data analysis and manipulation.

tavva_prudhvi
Mar 30, 2024

However, the question specifically says that the training code requires custom dependencies beyond those included in the prebuilt container. Therefore, using the prebuilt container alone would not be sufficient in this case. & regarding the use of a Python source distribution to avoid reading data from Cloud Storage multiple times, it's important to consider the trade-off between startup time and potential performance gains. While including the data in the source distribution might save some time during training, it also increases the size of the container and can lead to longer startup times. For small datasets, the overhead of reading data from Cloud Storage is typically negligible compared to the benefits of a smaller container and faster startup.

tavva_prudhvi
Mar 30, 2024

Also, creating a Python source distribution that includes the data and installs the dependencies at runtime can increase startup time since dependencies have to be installed every time the job runs

omribtOption: C
Jun 20, 2024

The focus is on startup time, and the dataset is small, so the container should still be of reasonable size. Downloading data from Cloud Storage introduces a delay.

CHARLIE2108Option: D
Mar 21, 2024

Why not C?

tavva_prudhvi
Mar 30, 2024

Because, Including the data in the container image is not recommended as it increases the image size and makes it less reusable.

raidenrock
May 2, 2024

But the description mentioned it is a small dataset and requires minimizing latency which makes C the best per requirement, there is no mentioning to make the container reusable whatsoever

bobjrOption: A
Jun 4, 2024

The dataset is small, xgboost is implemented in python...

bobjrOption: C
Jun 4, 2024

The dataset is small, xgboost is implemented in python... (correcting my error A answer)