Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 263


You are developing a custom TensorFlow classification model based on tabular data. Your raw data is stored in BigQuery. contains hundreds of millions of rows, and includes both categorical and numerical features. You need to use a MaxMin scaler on some numerical features, and apply a one-hot encoding to some categorical features such as SKU names. Your model will be trained over multiple epochs. You want to minimize the effort and cost of your solution. What should you do?

Show Answer
Correct Answer: BC

When dealing with a large dataset containing hundreds of millions of rows and requiring both numerical scaling and categorical encoding, it is crucial to use a solution that ensures scalability and efficiency. Using TFX (TensorFlow Extended) components with Dataflow offers a distributed, scalable method for feature engineering. TFX provides built-in components to handle transformations such as MinMax scaling and one-hot encoding, and Dataflow ensures that these transformations can be applied efficiently across large volumes of data. Exporting the results to Cloud Storage as TFRecords creates a streamlined workflow for feeding the preprocessed data into Vertex AI Training. This combination minimizes complexity in the preprocessing steps and leverages the distributed processing capabilities of Dataflow, making it an ideal approach for this scenario.

Discussion

11 comments
Sign in to comment
b2aaaceOption: C
Apr 27, 2024

"Full-pass stateful transformations aren't suitable for implementation in BigQuery. If you use BigQuery for full-pass transformations, you need auxiliary tables to store quantities needed by stateful transformations, such as means and variances to scale numerical features. Further, implementation of full-pass transformations using SQL on BigQuery creates increased complexity in the SQL scripts, and creates intricate dependency between training and the scoring SQL scripts." https://www.tensorflow.org/tfx/guide/tft_bestpractices#where_to_do_preprocessing

Prakzz
Jul 3, 2024

Isn't Dataflow includes a lot of effort as the question asking to minimize the effort here?

pikachu007Option: B
Jan 13, 2024

Option A: Involves creating a separate lookup table and deploying a Hugging Face model in BigQuery, increasing complexity and cost. Option C: While TFX offers robust preprocessing capabilities, it adds overhead for this use case and requires knowledge of Dataflow. Option D: Performing one-hot encoding in BigQuery can be less efficient than TensorFlow's optimized implementation.

guilhermebutzkeOption: B
Feb 19, 2024

My Answer: B 1. Use BigQuery to scale the numerical features.: Simpler and cheaper then use TFX components with Dataflow to scale the numerical features 2. Feed the features into Vertex AI Training. 3. Allow TensorFlow to perform the one-hot text encoding: TensorFlow handles the one-hot text encoding better than BQ.

b1a8faeOption: C
Jan 22, 2024

Inclined to choose C over B. By using TFX components with Dataflow, you can perform feature engineering on large-scale tabular data in a distributed and efficient way. You can use the Transform component to apply the MaxMin scaler and the one-hot encoding to the numerical and categorical features, respectively. You can also use the ExampleGen component to read data from BigQuery and the Trainer component to train your TensorFlow model.

daidai75Option: C
Jan 23, 2024

key messages: "contains hundreds of millions of rows, and includes both categorical and numerical features. You need to use a MaxMin scaler on some numerical features, and apply a one-hot encoding to some categorical features such as SKU names". Option B is not suitable for the big volume of data processing. Option C is better.

gscharlyOption: C
Apr 20, 2024

agree with daidai75

pinimichele01
Apr 26, 2024

Option B is not suitable for the big volume of data processing????? BQ is not suitable for big volume??.. for me is B

bobjrOption: D
Jun 4, 2024

GPT says D, Gemini says B, Perplexity says C.... I say D : stay in one tool, BQ, which is cheap and natively scalable. B has a risk of out of memory error.

cruise93Option: C
Apr 24, 2024

Agree with b1a8fae

fitri001Option: B
Apr 27, 2024

BigQuery for Preprocessing: BigQuery is a serverless data warehouse optimized for large datasets.expand_more It can handle scaling numerical features using built-in functions like SCALE or QUANTILE_SCALE, reducing the need for complex custom logic or separate lookup tables. TensorFlow for One-Hot Encoding: TensorFlow excels at in-memory processing. One-hot encoding of categorical features, especially text features like SKU names, can be efficiently performed within your TensorFlow model during training. This avoids unnecessary data movement or transformations in BigQuery. Vertex AI Training: By feeding the preprocessed data (scaled numerical features) directly into Vertex AI Training, you leverage its managed infrastructure for training your custom TensorFlow model.

fitri001
Apr 27, 2024

Option A: Creates unnecessary complexity and data movement. BigQuery is better suited for scaling numerical features, and TensorFlow is efficient for one-hot encoding. Option C: TFX is a powerful framework for complex pipelines, but for a simpler scenario like this, it might be an overkill. Additionally, exporting data as TFRecords adds an extra step, potentially increasing cost and complexity. Option D: One-hot encoding in BigQuery might be cumbersome for textual features like SKU names. pen_spark exclamation It can be computationally expensive and result in data explosion. TensorFlow handles this efficiently within the model.

dija123Option: C
Jul 2, 2024

agree with TFX components with Dataflow

AzureDP900Option: C
Jul 5, 2024

Option C uses TFX (TensorFlow Extended) components with Dataflow, which is a great way to perform complex data preprocessing tasks like one-hot encoding and scaling. This approach allows you to process your data in a scalable and efficient manner, using Cloud Storage as the output location. By exporting the results as TFRecords, you can easily feed this preprocessed data into Vertex AI Training for model development.