Professional Machine Learning Engineer Exam - Question 155

Question

You are training an object detection machine learning model on a dataset that consists of three million X-ray images, each roughly 2 GB in size. You are using Vertex AI Training to run a custom training application on a Compute Engine instance with 32-cores, 128 GB of RAM, and 1 NVIDIA P100 GPU. You notice that model training is taking a very long time. You want to decrease training time without sacrificing model performance. What should you do?

Examice · Accepted Answer

Given the context of training an object detection model on a dataset consisting of three million 2 GB X-ray images, using the tf.distribute.Strategy API to run a distributed training job is the most effective solution. With the large dataset and considerable computational load, distributed training allows the workload to be split across multiple machines or GPUs, if available, thereby significantly speeding up the training process without reducing model performance. This approach leverages the capacity of the Compute Engine instance efficiently, making it the best option among those provided.

fitri001 · Answer

Large Dataset: With millions of images, training on a single machine can be very slow. Distributed training allows you to split the training data and workload across multiple machines, significantly speeding up the process.

Vertex AI Training and tf.distribute: Vertex AI Training supports TensorFlow, and the tf.distribute library provides tools for implementing distributed training strategies. By leveraging this functionality, you can efficiently distribute the training tasks across the available cores and GPU on your Compute Engine instance (32 cores and 1 NVIDIA P100 GPU).

guilhermebutzke · Answer

D. Use the tf.distribute.Strategy API and run a distributed training job.

Here's why:

A. Increase instance memory and batch size: This might not be helpful. While increasing memory could help with loading more images at once, the main bottleneck here is likely processing these large images. Increasing the batch size can worsen the problem by further straining the GPU's memory.
B. Replace P100 with K80 GPU: A weaker GPU would likely slow down training instead of speeding it up.
C. Enable early stopping: This can save time but might stop training before reaching optimal performance.
D. Use tf.distribute.Strategy: This allows you to distribute the training workload across multiple GPUs or cores within your instance, significantly accelerating training without changing the model itself. This effectively leverages the available hardware efficiently.

[Removed] · Answer

The same comment as in Q96. If we look at our training infrastructure, we can see the bottleneck is obviously the GPU, which has 12GB or 16GB memory depending on the model (https://www.leadtek.com/eng/products/ai_hpc(37)/tesla_p100(761)/detail). This means we can afford to have a batch size of only 6-8 images (2GB each) even if we assume the GPU is utilized 100% and model weights take 0 memory. And remember the training size is 3M, which means each epoch will have 375-500K steps even in this unlikely best case.

With 32-cores and 128GB memory, we are able to afford higher batch sizes (e.g., 32), so moving to a K80 GPU that has 24GB of memory will accelerate the training.

A is wrong because we can't afford a larger batch size with the current GPU. D is wrong because you don't have multiple GPUs and your current GPU is saturated. C is a viable option, but it seems less optimal than B.

powerby35 · Answer

A
since we just have one gpu, we could not use tf.distribute.Strategy in D

PST21 · Answer

to decrease training time without sacrificing model performance, the best approach is to use the tf.distribute.Strategy API and run a distributed training job, leveraging the capabilities of the available GPU(s) for parallelized training.

ciro_li · Answer

https://www.tensorflow.org/guide/gpu 
?

bcama · Answer

perhaps the fact that the second or more GPUs are created is implied and the answer is D
 https://codelabs.developers.google.com/vertex_multiworker_training#2

pinimichele01 · Answer

https://www.tensorflow.org/guide/distributed_training#onedevicestrategy

Prakzz · Answer

Same Question as 96?

Professional Machine Learning Engineer Exam - Question 155

Discussion