Professional Machine Learning Engineer Exam - Question 45

Question

You are training a TensorFlow model on a structured dataset with 100 billion records stored in several CSV files. You need to improve the input/output execution performance. What should you do?

Examice · Accepted Answer

To improve input/output execution performance when training a TensorFlow model on a massive structured dataset, the best approach is to convert the CSV files into shards of TFRecords and store the data in Cloud Storage. TFRecords is a TensorFlow-specific binary format optimized for efficiency in data storage and retrieval. Sharding the TFRecords allows for parallel data loading, significantly boosting input/output performance. Cloud Storage provides high throughput and low-latency access, making it an excellent choice for handling large-scale data required for TensorFlow model training.

ralf_cc · Answer

C - not enough info in the question, but C is the "most correct" one

David_ml · Answer

Answer is C. TFRecords in cloud storage for big data is the recommended practice by Google for training TF models.

giaZ · Answer

Google best practices: Use Cloud Storage buckets and directories to group the shards of data (either sharded TFRecord files if using Tensorflow, or Avro if using any other framework). Aim for files of at least 100Mb, and 100 - 10000 shards.

behzadsw · Answer

https://cloud.google.com/architecture/ml-on-gcp-best-practices#store-tabular-data-in-bigquery
BigQuery for structured data, cloud storage for unstructed data

M25 · Answer

Went with C

Voyager2 · Answer

A. Load the data into BigQuery, and read the data from BigQuery.
https://cloud.google.com/blog/products/ai-machine-learning/tensorflow-enterprise-makes-accessing-data-on-google-cloud-faster-and-easier
Precisely on this link provided in other comments it whos that the best shot with tfrecords is: 18752 Records per second. In the same report it shows that bigquery is morethan 40000 recors per second

peetTech · Answer

C https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards#:~:text=Splitting%20TFRecord%20files%20into%20shards,them%20through%20a%20training%20process.

Mohamed_Mossad · Answer

"100 billion records stored in several CSV files" that means we deal with distributed big data problem , so HDFS is very suitable , Will choose D

shankalman717 · Answer

Option C, converting the CSV files into shards of TFRecords and storing the data in Cloud Storage, is the most appropriate solution for improving input/output execution performance in this scenario

shankalman717 · Answer

Cloud Bigtable is typically used to process unstructured data, such as time-series data, logs, or other types of data that do not conform to a fixed schema. However, Cloud Bigtable can also be used to store structured data if necessary, such as in the case of a key-value store or a database that does not require complex relational queries.

PST21 · Answer

While Bigtable can offer high-performance I/O capabilities, it is important to note that it is primarily designed for structured data storage and real-time access patterns. In this scenario, the focus is on optimizing input/output execution performance, and using TFRecords in Cloud Storage aligns well with that goal.

tavva_prudhvi · Answer

Using BigQuery or Bigtable may not be the most efficient option for input/output operations with TensorFlow. Storing the data in HDFS may be an option, but Cloud Storage is generally a more scalable and cost-effective solution.

ftl · Answer

bard: The correct answer is:

C. Convert the CSV files into shards of TFRecords, and store the data in Cloud Storage.
TFRecords is a TensorFlow-specific binary format that is optimized for performance. Converting the CSV files into TFRecords will improve the input/output execution performance. Sharding the TFRecords will allow the data to be read in parallel, which will further improve performance.

The other options are not as likely to improve performance.

Loading the data into BigQuery or Cloud Bigtable will add an additional layer of abstraction, which can slow down performance.
Storing the TFRecords in HDFS is not likely to improve performance, as HDFS is not optimized for TensorFlow.

peetTech · Answer

C https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards#:~:text=Splitting%20TFRecord%20files%20into%20shards,them%20through%20a%20training%20process.

Sum_Sum · Answer

C is the correct one as BQ will not help you with performance

fragkris · Answer

C is the google reccomended approach.

PhilipKoku · Answer

C) The most suitable option for improving input/output execution performance in this scenario is C. Convert the CSV files into shards of TFRecords and store the data in Cloud Storage. This approach leverages the efficiency of TFRecords and the scalability of Cloud Storage, aligning with TensorFlow best practices.

Professional Machine Learning Engineer Exam - Question 45

Discussion