Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 33


You have a demand forecasting pipeline in production that uses Dataflow to preprocess raw data prior to model training and prediction. During preprocessing, you employ Z-score normalization on data stored in BigQuery and write it back to BigQuery. New training data is added every week. You want to make the process more efficient by minimizing computation time and manual intervention. What should you do?

Show Answer
Correct Answer: B

Translating the normalization algorithm into SQL for use with BigQuery will streamline the process, leveraging the built-in compute power of BigQuery to handle data transformations efficiently. This approach minimizes the need for additional services, reducing computation time and manual intervention by processing data directly where it is stored.

Discussion

14 comments
Sign in to comment
maartenalexanderOption: B
Jun 22, 2021

B. I think. BiqQuery definitely minimizes computational time for normalization. I think it would also minimize manual intervention. For data normalization in dataflow you'd have to pass in values of mean and standard deviation as a side-input. That seems more work than a simple SQL query

93alejandrosanchez
Oct 21, 2021

I agree that B would definitely get the job done. But wouldn't D work as well and keep all the data pre-processing in Dataflow?

kaike_reis
Nov 13, 2021

Dataflow uses Beam, different from dataproc that uses Spark. I think that D would be wrong because we would add one more service into the pipeline for a simple transformation (minus the mean and divide by std).

Mohamed_MossadOption: B
Jun 13, 2022

B is the most efficient as you will not load --> process --> save , no you will only write some sql in bigquery and voila :D

alashinOption: B
Jul 6, 2021

B. I agree with B as well.

xiaoFOption: B
Feb 1, 2022

I agree with B.

baimusOption: B
Mar 17, 2022

It's B, bigquery can do this internally, no need for dataflow

Fatiy
Feb 28, 2023

SQL is not as flexible as other programming languages like Python, which can limit the ability to customize the normalization process or incorporate new features in the future.

ares81Option: B
Jan 11, 2023

Best way to proceed is B.

Fatiy
Feb 28, 2023

SQL is not as flexible as other programming languages like Python, which can limit the ability to customize the normalization process or incorporate new features in the future.

FatiyOption: D
Feb 28, 2023

Option D is the best solution because Apache Spark provides a distributed computing platform that can handle large-scale data processing with ease. By using the Dataproc connector for BigQuery, Spark can read data directly from BigQuery and perform the normalization process in a distributed manner. This can significantly reduce computation time and manual intervention. Option A is not a good solution because Kubernetes is a container orchestration platform that does not directly provide data normalization capabilities. Option B is not a good solution because Z-score normalization is a data transformation technique that cannot be easily translated into SQL. Option C is not a good solution because the normalizer_fn argument in TensorFlow's Feature Column API is only applicable for feature normalization during model training, not for data preprocessing.

SergioRubianoOption: B
Mar 24, 2023

Best way is B

M25Option: B
May 9, 2023

Went with B

elenamatayOption: B
Sep 12, 2023

B. All that maartenalexander said, + BigQuery already has a function for that: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-standard-scaler , we could even schedule the query for calculating this automatically :)

SamuelTschOption: B
Jul 7, 2023

A, D usually need additional configuration, which could cost much more time.

aaggiiOption: C
Jul 14, 2023

Every week when new data is loaded mean and standard deviation is calculated for it and passed as parameter to calculate z score at serving https://towardsdatascience.com/how-to-normalize-features-in-tensorflow-5b7b0e3a4177

tavva_prudhvi
Jul 20, 2023

owever, in the given scenario, you are using Dataflow for preprocessing and BigQuery for storing data. To make the process more efficient by minimizing computation time and manual intervention, you should still opt for option B: Translate the normalization algorithm into SQL for use with BigQuery. This way, you can perform the normalization directly in BigQuery, which will save time and resources compared to using an external tool.

Sum_SumOption: B
Nov 15, 2023

z-scores is very easy to do in BQ - no need for more complex solutions

PhilipKokuOption: B
Jun 6, 2024

B) Using BigQuery