Professional Machine Learning Engineer Exam - Question 70

Question

You lead a data science team at a large international corporation. Most of the models your team trains are large-scale models using high-level TensorFlow APIs on AI Platform with GPUs. Your team usually takes a few weeks or months to iterate on a new version of a model. You were recently asked to review your team’s spending. How should you reduce your Google Cloud compute costs without impacting the model’s performance?

Examice · Accepted Answer

To reduce your Google Cloud compute costs without impacting model performance, the best approach would be to migrate to training with Kubeflow on Google Kubernetes Engine and use preemptible VMs with checkpoints. Preemptible VMs are significantly cheaper than standard VMs; however, they can be terminated with little notice. Checkpoints ensure that the progress of your model training is saved periodically, so if a VM is terminated, the training can resume from the last checkpoint instead of starting from the beginning, thereby saving time and resources.

seifou · Answer

https://cloud.google.com/blog/products/ai-machine-learning/reduce-the-costs-of-ml-workflows-with-preemptible-vms-and-gpus?hl=en

hiromi · Answer

It's seem C
- https://www.kubeflow.org/docs/distributions/gke/pipelines/preemptible/
- https://cloud.google.com/optimization/docs/guide/checkpointing

neochaotic · Answer

C -  Reduce cost with preemptive instances and add checkpoints to snapshot intermediate results

ares81 · Answer

"A Preemptible VM (PVM) is a Google Compute Engine (GCE) virtual machine (VM) instance that can be purchased for a steep discount as long as the customer accepts that the instance will terminate after 24 hours."
This excludes C and D. Checkpoints are needed for long processing, so A.

joaquinmenendez · Answer

C is the best approach because it allows you to reduce your compute costs without impacting the model's performance. Preemptible VMs are much cheaper than standard VMs, but they can be terminated at any time. By using checkpoints, you can ensure that your training job can be resumed if a preemptible VM is terminated. 
Also, even if training takes days, the checkpoints will prevent lossing the progress if preemtible VM are down.

MultiCloudIronMan · Answer

Pre-emptive VMs are cheaper and checkpoints will enable termination if the result is acceptable

LearnSodas · Answer

Saving checkpoints avoids re-run from scratch

ares81 · Answer

It's A.

learner · Answer

preemtible vm are valid for 24hrs. Hence training needs months to complete which is mentioned in question that makes A is answer.

Liting · Answer

Optimize cost then should use kubeflow

YangG · Answer

I think it should be A
https://cloud.google.com/ai-platform/training/docs/overview

John_Pongthorn · Answer

C is out of date ? AI Platform is Vertex-AI ,so , this is a simple scenario that would accommodate infrastructure for this case.

tavva_prudhvi · Answer

Additionally, AI Platform's autoscaling feature can automatically adjust the number of resources used based on the workload, further optimizing costs.

CloudKida · Answer

https://cloud.google.com/ai-platform/prediction/docs/ai-explanations/overview
AI Explanations helps you understand your model's outputs for classification and regression tasks. Whenever you request a prediction on AI Platform, AI Explanations tells you how much each feature in the data contributed to the predicted result. You can then use this information to verify that the model is behaving as expected, recognize bias in your models, and get ideas for ways to improve your model and your training data.

M25 · Answer

Went with C

libo1985 · Answer

I guess distributed training is not cheap. So C.

PhilipKoku · Answer

C) Preemptible VMs with Check points

Professional Machine Learning Engineer Exam - Question 70

Discussion