Professional Machine Learning Engineer Exam - Question 33

Question

You have a demand forecasting pipeline in production that uses Dataflow to preprocess raw data prior to model training and prediction. During preprocessing, you employ Z-score normalization on data stored in BigQuery and write it back to BigQuery. New training data is added every week. You want to make the process more efficient by minimizing computation time and manual intervention. What should you do?

Examice · Accepted Answer

Translating the normalization algorithm into SQL for use with BigQuery will streamline the process, leveraging the built-in compute power of BigQuery to handle data transformations efficiently. This approach minimizes the need for additional services, reducing computation time and manual intervention by processing data directly where it is stored.

maartenalexander · Answer

B. I think. BiqQuery definitely minimizes computational time for normalization. I think it would also minimize manual intervention. For data normalization in dataflow you'd have to pass in values of mean and standard deviation as a side-input. That seems more work than a simple SQL query

Mohamed_Mossad · Answer

B is the most efficient as you will not load --> process --> save , no you will only write some sql in bigquery and voila :D

alashin · Answer

B. I agree with B as well.

xiaoF · Answer

I agree with B.

baimus · Answer

It's B, bigquery can do this internally, no need for dataflow

ares81 · Answer

Best way to proceed is B.

Fatiy · Answer

Option D is the best solution because Apache Spark provides a distributed computing platform that can handle large-scale data processing with ease. By using the Dataproc connector for BigQuery, Spark can read data directly from BigQuery and perform the normalization process in a distributed manner. This can significantly reduce computation time and manual intervention. Option A is not a good solution because Kubernetes is a container orchestration platform that does not directly provide data normalization capabilities. Option B is not a good solution because Z-score normalization is a data transformation technique that cannot be easily translated into SQL. Option C is not a good solution because the normalizer_fn argument in TensorFlow's Feature Column API is only applicable for feature normalization during model training, not for data preprocessing.

SergioRubiano · Answer

Best way is B

M25 · Answer

Went with B

elenamatay · Answer

B. All that maartenalexander said, + BigQuery already has a function for that: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-standard-scaler , we could even schedule the query for calculating this automatically :)

SamuelTsch · Answer

A, D usually need additional configuration, which could cost much more time.

aaggii · Answer

Every week when new data is loaded mean and standard deviation is calculated for it and passed as parameter to calculate z score at serving 
https://towardsdatascience.com/how-to-normalize-features-in-tensorflow-5b7b0e3a4177

Sum_Sum · Answer

z-scores is very easy to do in BQ - no need for more complex solutions

PhilipKoku · Answer

B) Using BigQuery

Professional Machine Learning Engineer Exam - Question 33

Discussion