Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 96


You are training an object detection machine learning model on a dataset that consists of three million X-ray images, each roughly 2 GB in size. You are using Vertex AI Training to run a custom training application on a Compute Engine instance with 32-cores, 128 GB of RAM, and 1 NVIDIA P100 GPU. You notice that model training is taking a very long time. You want to decrease training time without sacrificing model performance. What should you do?

Show Answer
Correct Answer: B

To decrease training time without sacrificing model performance, replacing the NVIDIA P100 GPU with a v3-32 TPU in the training job is the most effective solution. TPUs (Tensor Processing Units) are specifically designed to accelerate machine learning workloads and can handle high computational demands better than GPUs. Given the significant amount of data involved (3 million images, each 2GB in size) and the need for fast processing, using a TPU will provide a substantial performance boost, thereby reducing training time effectively.

Discussion

17 comments
Sign in to comment
smarquesOption: C
Jan 18, 2023

I would say C. The question asks about time, so the option "early stopping" looks fine because it will no impact the existent accuracy (it will maybe improve it). The tf.distribute.Strategy reading the TF docs says that it's used when you want to split training between GPUs, but the question says that we have a single GPU. Open to discuss. :)

djo06
Jul 12, 2023

tf.distribute.OneDeviceStrategy uses parallel training on one GPU

[Removed]Option: B
Jul 21, 2023

I don't understand why so many people are voting for D (tf.distribute.Strategy API). If we look at our training infrastructure, we can see the bottleneck is obviously the GPU, which has 12GB or 16GB memory depending on the model (https://www.leadtek.com/eng/products/ai_hpc(37)/tesla_p100(761)/detail). This means we can afford to have a batch size of only 6-8 images (2GB each) even if we assume the GPU is utilized 100%. And remember the training size is 3M, which means each epoch will have 375-500K steps in the best case. With 32-cores and 128GB memory, we are able to afford higher batch sizes (e.g., 32), so moving to TPU will accelerate the training. A is wrong because we can't afford a larger batch size with the current GPU. D is wrong because you don't have multiple GPUs and your current GPU is saturated. C is a viable option, but it seems less optimal than B.

[Removed]
Jul 25, 2023

I should note that the batch size should be lower than even 6-8 images because the model weights will also take the GPU memory.

Krish6488Option: B
Nov 11, 2023

I would go with B as v3-32 TPU offers much more computational power than a single P100 GPU, and this upgrade should provide a substantial decrease in training time. Also tf.distributestrategy is good to perform distreibuted training on multiple GPUs or TPUs but the current setup has just one GPU which makes it the second best option provided the architecture uses multiple GPUs. Increase in memory may allow large batch size but wont address the fundamental problem which is over utilised GPU Early stopping is good for avoiding overfitting when model already starts performing at its best. Its good to reduce overall training time but wont improve the training speed

Mickey321Option: B
Nov 15, 2023

B as it have only one GPU hence in D distributed not efficient

PST21Option: B
Jul 31, 2023

Option D, using the tf.distribute.Strategy API for distributed training, can be beneficial for improving training efficiency, but it would require additional resources and complexity to set up compared to simply using a TPU. Therefore, replacing the NVIDIA P100 GPU with a v3-32 TPU in the Vertex AI Training job would be the most effective way to decrease training time while maintaining or even improving model performance

picoOption: B
Sep 13, 2023

Given the options and the goal of decreasing training time, options B (using TPUs) and D (distributed training) are the most effective ways to achieve this goal C. Enable early stopping in your Vertex AI Training job: Early stopping is a technique that can help save training time by monitoring a validation metric and stopping the training process when the metric stops improving. While it can help in terms of stopping unnecessary training runs, it may not provide as substantial a speedup as other options.

tavva_prudhvi
Nov 7, 2023

TPUs (Tensor Processing Units) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. They are often faster than GPUs for specific types of computations. However, not all models or training pipelines will benefit from TPUs, and they might require code modification to fully utilize the TPU capabilities.

Werner123Option: D
Feb 29, 2024

In my eyes the only solution is distributed training. 3 000 000 x 2GB = 6 Petabytes worth of data. No single device will get you there.

andreabrunelliOption: B
Apr 24, 2024

I would say B: A. Increse memory doesn't mean necessary a speed up of the process, it's not a batch-size problem B. It seems a image -> Tensorflow situation. So transforming image into tensors means that a TPU works better and maybe faster C. It's not a overfitting problem D. Same here, it's not a memory or input-size problem

jullietOption: A
May 24, 2023

went with A

Voyager2Option: D
Jun 5, 2023

D. Use the tf.distribute.Strategy API and run a distributed training job Option B replaces GPU with TPU which is not the best option for image procesing. Early stop will affect model performance.

julliet
Jun 15, 2023

to run distribution job you need to have more than 1 GPU. we have exactly one here

SamuelTschOption: C
Jul 8, 2023

went with C

djo06Option: D
Jul 13, 2023

tf.distribute.OneDeviceStrategy uses parallel training on one GPU

andresvelascoOption: C
Sep 10, 2023

A. Increase the instance memory to 512 GB and increase the batch size. > this will not necessarily decrease training time B. Replace the NVIDIA P100 GPU with a v3-32 TPU in the training job. Most Voted > TPU can sacrifice performance C. Enable early stopping in your Vertex AI Training job. > YES, this decreases training time without sacrificing performance, if set properly D. Use the tf.distribute.Strategy API and run a distributed training job. > No idea .... But I believe the type of machine and architecture cannot be changed as per the wording of the question.

tavva_prudhvi
Nov 7, 2023

Early stopping is a method that allows you to stop training once the model performance stops improving on a validation dataset. While it can prevent overfitting and save time by stopping unnecessary training epochs, it does not inherently speed up the training process.

pico
Nov 14, 2023

f the question didn't specify the framework used, and you want to choose an option that is more framework-agnostic, it's important to consider the available options. Given the context and the need for a framework-agnostic approach, you might consider a combination of options A and D. Increasing instance memory and batch size can still be beneficial, and if you're using a deep learning framework that supports distributed training (like TensorFlow or PyTorch), implementing distributed training (Option D) can further accelerate the process.

ludovikushOption: B
Feb 28, 2024

Agree with JamesDoes

pinimichele01Option: D
Apr 21, 2024

https://www.tensorflow.org/guide/distributed_training#onedevicestrategy

pinimichele01
Apr 21, 2024

https://www.tensorflow.org/guide/distributed_training#onedevicestrategy -> D

dija123Option: B
Jun 20, 2024

Agree with B