Professional Machine Learning Engineer Exam - Question 232

Question

You need to use TensorFlow to train an image classification model. Your dataset is located in a Cloud Storage directory and contains millions of labeled images. Before training the model, you need to prepare the data. You want the data preprocessing and model training workflow to be as efficient, scalable, and low maintenance as possible. What should you do?

Examice · Accepted Answer

To prepare a large dataset efficiently, it is best to use Google Cloud services that offer scalability and parallel processing. Creating a Dataflow job to shard and store the images as TFRecord files in Cloud Storage ensures that the data can be read and processed efficiently by TensorFlow. This approach leverages the scalability of Cloud Storage and the efficient data handling of TFRecord files. Using Vertex AI Training with a V100 GPU for model training further enhances efficiency and performance due to the optimized hardware and infrastructure specifically designed for machine learning tasks. This combination provides a scalable, efficient, and low maintenance solution.

pinimichele01 · Answer

millions of labeled images -> dataflow
tfrecord faster than folder-based

b1a8fae · Answer

Ideally you want to export your data in TFRecords (most efficient image format) in Cloud Storage, and not in the instance (to improve scalability)

pikachu007 · Answer

B. Folder-Based Structure: While viable, it's less efficient for large datasets compared to TFRecord files, potentially leading to slower I/O during training.
C. Workbench Processing: Local preprocessing on a single instance can be less scalable and efficient for millions of images, potentially introducing bottlenecks.
D. Workbench Training: While Workbench offers a Jupyter environment, Vertex AI Training is specifically designed for scalable model training, providing optimized hardware and infrastructure.

AzureDP900 · Answer

A is correct Here's why 
You need to prepare the data before training an image classification model.
Using TFRecord files allows you to store your data in a format that can be efficiently read and processed by TensorFlow.
Sharding the data into multiple files allows for parallel processing and scalability.
Dataflow is a Google Cloud service that provides a scalable and reliable way to process large datasets.
By using Vertex AI Training with a V100 GPU, you can train your model in an efficient and cost-effective manner.

Professional Machine Learning Engineer Exam - Question 232

Discussion