Professional Data Engineer Exam - Question 151

Question

You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

Examice · Accepted Answer

To migrate the existing Spark ML model training pipelines quickly to Google Cloud, using Dataproc is the most suitable option. Dataproc is a fully managed Spark and Hadoop service that allows you to run Spark jobs seamlessly on Google Cloud. It integrates well with BigQuery, enabling you to read data directly from BigQuery into Dataproc without any need for complex data transformations or exports. This approach leverages the scalability and performance benefits of both Dataproc and BigQuery, allowing for a rapid lift-and-shift migration without the need for significant rewrites or code changes.

vamgcp · Answer

Option C : It is the most rapid way to migrate your existing training pipelines to Google Cloud.
It allows you to continue using your existing Spark ML models.
It allows you to take advantage of the scalability and performance of Dataproc.
It allows you to read data directly from BigQuery, which is a more efficient way to process large datasets

vaga1 · Answer

the question is: is it faster to move a SparkML job to a Vertex AI or to Dataproc? I am personally not sure, I would go for Dataproc as notebooks are not mentioned, but reading the Google article:

https://cloud.google.com/blog/topics/developers-practitioners/announcing-serverless-spark-components-vertex-ai-pipelines/

"Dataproc Serverless components for Vertex AI Pipelines that further simplify MLOps for Spark, Spark SQL, PySpark and Spark jobs."

KC_go_reply · Answer

It is obviously C) Dataproc, since we don't want to rewrite the training from scratch, highly prefer Dataproc for anything Hadoop/Spark ecosystem, and Vertex AI doesn't support *training* with SparkML (but deploying existing models).

ckanaar · Answer

The updated answer seems A based on the following article:

https://cloud.google.com/blog/topics/developers-practitioners/announcing-serverless-spark-components-vertex-ai-pipelines/

MaxNRG · Answer

Use Cloud Dataproc, BigQuery, and Apache Spark ML for Machine Learning
https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml
Using Apache Spark with TensorFlow on Google Cloud Platform
https://cloud.google.com/blog/products/gcp/using-apache-spark-with-tensorflow-on-google-cloud-platform

knith66 · Answer

If you wanted to use Vertex AI for training Spark ML models, you would typically need to convert your Spark ML code to another supported machine learning framework like TensorFlow or scikit-learn. Then you could use Vertex AI's pre-built training and prediction services for those frameworks.

barnac1es · Answer

Dataproc for Spark: Google Cloud Dataproc is a managed Spark and Hadoop service that allows you to run Spark jobs seamlessly on Google Cloud. It provides the flexibility to run Spark jobs using Spark MLlib and other Spark libraries.

BigQuery Integration: You mentioned that your data is being migrated to BigQuery. Dataproc has native integration with BigQuery, allowing you to read data directly from BigQuery tables. This eliminates the need to export data from BigQuery to another storage system before processing it with Spark.

Rapid Migration: This approach allows you to quickly migrate your existing Spark ML models and training pipelines without the need for a complete rewrite or extensive changes to your existing workflows. You can continue using your Spark ML models while adapting them to read data from BigQuery.

Takshashila · Answer

Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

blathul · Answer

Dataproc is a managed Spark and Hadoop service on Google Cloud, which makes it an ideal choice for migrating your existing Spark ML training pipelines. By using Dataproc, you can continue to leverage Spark and its ML capabilities without the need for significant code changes or rewriting your models.
By combining Dataproc and BigQuery, you can create Spark jobs or workflows in Dataproc that read data from BigQuery and train your existing Spark ML models. This approach allows you to quickly migrate your training pipelines to Google Cloud and take advantage of the scalability and performance benefits of both Dataproc and BigQuery.

wan2three · Answer

Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery using standard SQL queries on existing business intelligence tools and spreadsheets, or you can export datasets from BigQuery directly into Vertex AI Workbench and run your models from there. 
https://cloud.google.com/vertex-ai#all-features:~:text=Data%20and%20AI%20integration

FP77 · Answer

The answer is C. Spin up a Cloud Dataproc Cluster, migrate spark jobs to there, and link the Cluster to Bgquery with the connector. It's a straightforward solution.

DeepakVenkatachalam · Answer

they are talking about rapid lift and shift, in which case Dataproc cluster will be right one for Spark ML models for lift and shift. so I think the answer is C.

Nandababy · Answer

Why not option D? To spin up the spark cluster on compute engine, considering rapid migration it potentially could be best approach as team wont have to re-work on model (may be only few configurational changes) and again to get data from Bigquery which is required periodically not all the time, could be easy.
With Dataproc it would have more code changes eventually can take more time.
With Vertex AI it doesn't support spark ML natively and also training would be black box.

For me Answer should be D.

Matt_108 · Answer

Option C, agreed with other comments

GCP001 · Answer

C looks more suitable as data is alerady on BigQuery. 
Ref - https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml

mothkuri · Answer

C
Question is about rapid lift and shift. So code changes should be minimul

Anudeep58 · Answer

Vertex AI is better suited for TensorFlow or scikit-learn models. Direct Spark ML support isn't native to Vertex AI, making this a less straightforward migration path.

Professional Data Engineer Exam - Question 151

Discussion