Professional Machine Learning Engineer Exam - Question 263

Question

You are developing a custom TensorFlow classification model based on tabular data. Your raw data is stored in BigQuery. contains hundreds of millions of rows, and includes both categorical and numerical features. You need to use a MaxMin scaler on some numerical features, and apply a one-hot encoding to some categorical features such as SKU names. Your model will be trained over multiple epochs. You want to minimize the effort and cost of your solution. What should you do?

Examice · Accepted Answer

When dealing with a large dataset containing hundreds of millions of rows and requiring both numerical scaling and categorical encoding, it is crucial to use a solution that ensures scalability and efficiency. Using TFX (TensorFlow Extended) components with Dataflow offers a distributed, scalable method for feature engineering. TFX provides built-in components to handle transformations such as MinMax scaling and one-hot encoding, and Dataflow ensures that these transformations can be applied efficiently across large volumes of data. Exporting the results to Cloud Storage as TFRecords creates a streamlined workflow for feeding the preprocessed data into Vertex AI Training. This combination minimizes complexity in the preprocessing steps and leverages the distributed processing capabilities of Dataflow, making it an ideal approach for this scenario.

b2aaace · Answer

"Full-pass stateful transformations aren't suitable for implementation in BigQuery. If you use BigQuery for full-pass transformations, you need auxiliary tables to store quantities needed by stateful transformations, such as means and variances to scale numerical features. Further, implementation of full-pass transformations using SQL on BigQuery creates increased complexity in the SQL scripts, and creates intricate dependency between training and the scoring SQL scripts."
https://www.tensorflow.org/tfx/guide/tft_bestpractices#where_to_do_preprocessing

pikachu007 · Answer

Option A: Involves creating a separate lookup table and deploying a Hugging Face model in BigQuery, increasing complexity and cost.
Option C: While TFX offers robust preprocessing capabilities, it adds overhead for this use case and requires knowledge of Dataflow.
Option D: Performing one-hot encoding in BigQuery can be less efficient than TensorFlow's optimized implementation.

guilhermebutzke · Answer

My Answer: B

1. Use BigQuery to scale the numerical features.: Simpler and cheaper then use TFX components with Dataflow to scale the numerical features
2. Feed the features into Vertex AI Training.
3. Allow TensorFlow to perform the one-hot text encoding: TensorFlow handles the one-hot text encoding better than BQ.

b1a8fae · Answer

Inclined to choose C over B. By using TFX components with Dataflow, you can perform feature engineering on large-scale tabular data in a distributed and efficient way. You can use the Transform component to apply the MaxMin scaler and the one-hot encoding to the numerical and categorical features, respectively. You can also use the ExampleGen component to read data from BigQuery and the Trainer component to train your TensorFlow model.

daidai75 · Answer

key messages: "contains hundreds of millions of rows, and includes both categorical and numerical features. You need to use a MaxMin scaler on some numerical features, and apply a one-hot encoding to some categorical features such as SKU names". 
Option B is not suitable for the big volume of data processing. Option C is better.

gscharly · Answer

agree with daidai75

bobjr · Answer

GPT says D, Gemini says B, Perplexity says C....

I say D : stay in one tool, BQ, which is cheap and natively scalable.

B has a risk of out of memory error.

cruise93 · Answer

Agree with b1a8fae

fitri001 · Answer

BigQuery for Preprocessing:
BigQuery is a serverless data warehouse optimized for large datasets.expand_more It can handle scaling numerical features using built-in functions like SCALE or QUANTILE_SCALE, reducing the need for complex custom logic or separate lookup tables.
TensorFlow for One-Hot Encoding:
TensorFlow excels at in-memory processing. One-hot encoding of categorical features, especially text features like SKU names, can be efficiently performed within your TensorFlow model during training. This avoids unnecessary data movement or transformations in BigQuery.
Vertex AI Training:
By feeding the preprocessed data (scaled numerical features) directly into Vertex AI Training, you leverage its managed infrastructure for training your custom TensorFlow model.

dija123 · Answer

agree with TFX components with Dataflow

AzureDP900 · Answer

Option C uses TFX (TensorFlow Extended) components with Dataflow, which is a great way to perform complex data preprocessing tasks like one-hot encoding and scaling.
This approach allows you to process your data in a scalable and efficient manner, using Cloud Storage as the output location.
By exporting the results as TFRecords, you can easily feed this preprocessed data into Vertex AI Training for model development.

Professional Machine Learning Engineer Exam - Question 263

Discussion