Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 70


You lead a data science team at a large international corporation. Most of the models your team trains are large-scale models using high-level TensorFlow APIs on AI Platform with GPUs. Your team usually takes a few weeks or months to iterate on a new version of a model. You were recently asked to review your team’s spending. How should you reduce your Google Cloud compute costs without impacting the model’s performance?

Show Answer
Correct Answer: C

To reduce your Google Cloud compute costs without impacting model performance, the best approach would be to migrate to training with Kubeflow on Google Kubernetes Engine and use preemptible VMs with checkpoints. Preemptible VMs are significantly cheaper than standard VMs; however, they can be terminated with little notice. Checkpoints ensure that the progress of your model training is saved periodically, so if a VM is terminated, the training can resume from the last checkpoint instead of starting from the beginning, thereby saving time and resources.

Discussion

17 comments
Sign in to comment
seifouOption: C
Dec 16, 2022

https://cloud.google.com/blog/products/ai-machine-learning/reduce-the-costs-of-ml-workflows-with-preemptible-vms-and-gpus?hl=en

hiromiOption: C
Dec 16, 2022

It's seem C - https://www.kubeflow.org/docs/distributions/gke/pipelines/preemptible/ - https://cloud.google.com/optimization/docs/guide/checkpointing

neochaoticOption: C
Dec 10, 2022

C - Reduce cost with preemptive instances and add checkpoints to snapshot intermediate results

ares81Option: A
Dec 11, 2022

"A Preemptible VM (PVM) is a Google Compute Engine (GCE) virtual machine (VM) instance that can be purchased for a steep discount as long as the customer accepts that the instance will terminate after 24 hours." This excludes C and D. Checkpoints are needed for long processing, so A.

joaquinmenendezOption: C
Sep 19, 2023

C is the best approach because it allows you to reduce your compute costs without impacting the model's performance. Preemptible VMs are much cheaper than standard VMs, but they can be terminated at any time. By using checkpoints, you can ensure that your training job can be resumed if a preemptible VM is terminated. Also, even if training takes days, the checkpoints will prevent lossing the progress if preemtible VM are down.

MultiCloudIronManOption: C
Apr 1, 2024

Pre-emptive VMs are cheaper and checkpoints will enable termination if the result is acceptable

LearnSodasOption: A
Dec 10, 2022

Saving checkpoints avoids re-run from scratch

ares81Option: A
Jan 4, 2023

It's A.

learnerOption: A
May 5, 2023

preemtible vm are valid for 24hrs. Hence training needs months to complete which is mentioned in question that makes A is answer.

LitingOption: C
Jul 7, 2023

Optimize cost then should use kubeflow

YangGOption: A
Dec 9, 2022

I think it should be A https://cloud.google.com/ai-platform/training/docs/overview

John_PongthornOption: C
Jan 25, 2023

C is out of date ? AI Platform is Vertex-AI ,so , this is a simple scenario that would accommodate infrastructure for this case.

tavva_prudhvi
Mar 17, 2023

Additionally, AI Platform's autoscaling feature can automatically adjust the number of resources used based on the workload, further optimizing costs.

tavva_prudhvi
Mar 17, 2023

I think it’s a. By using distributed training jobs with checkpoints, you can train your models on multiple GPUs simultaneously, which reduces the training time. Checkpoints allow you to save the progress of your training jobs regularly, so if the training job gets interrupted or fails, you can restart it from the last checkpoint instead of starting from scratch. This saves time and resources, which reduces costs. Additionally, AI Platform's autoscaling feature can automatically adjust the number of resources used based on the workload, further optimizing costs.

CloudKidaOption: C
May 8, 2023

https://cloud.google.com/ai-platform/prediction/docs/ai-explanations/overview AI Explanations helps you understand your model's outputs for classification and regression tasks. Whenever you request a prediction on AI Platform, AI Explanations tells you how much each feature in the data contributed to the predicted result. You can then use this information to verify that the model is behaving as expected, recognize bias in your models, and get ideas for ways to improve your model and your training data.

M25Option: C
May 9, 2023

Went with C

libo1985Option: C
Sep 27, 2023

I guess distributed training is not cheap. So C.

PhilipKokuOption: C
Jun 7, 2024

C) Preemptible VMs with Check points