Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 151


You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

Show Answer
Correct Answer: AC

To migrate the existing Spark ML model training pipelines quickly to Google Cloud, using Dataproc is the most suitable option. Dataproc is a fully managed Spark and Hadoop service that allows you to run Spark jobs seamlessly on Google Cloud. It integrates well with BigQuery, enabling you to read data directly from BigQuery into Dataproc without any need for complex data transformations or exports. This approach leverages the scalability and performance benefits of both Dataproc and BigQuery, allowing for a rapid lift-and-shift migration without the need for significant rewrites or code changes.

Discussion

17 comments
Sign in to comment
vamgcpOption: C
Jul 26, 2023

Option C : It is the most rapid way to migrate your existing training pipelines to Google Cloud. It allows you to continue using your existing Spark ML models. It allows you to take advantage of the scalability and performance of Dataproc. It allows you to read data directly from BigQuery, which is a more efficient way to process large datasets

vaga1Option: A
May 10, 2023

the question is: is it faster to move a SparkML job to a Vertex AI or to Dataproc? I am personally not sure, I would go for Dataproc as notebooks are not mentioned, but reading the Google article: https://cloud.google.com/blog/topics/developers-practitioners/announcing-serverless-spark-components-vertex-ai-pipelines/ "Dataproc Serverless components for Vertex AI Pipelines that further simplify MLOps for Spark, Spark SQL, PySpark and Spark jobs."

emmylou
Nov 20, 2023

But you would need to re-write your models which can be a block

KC_go_replyOption: C
Jun 22, 2023

It is obviously C) Dataproc, since we don't want to rewrite the training from scratch, highly prefer Dataproc for anything Hadoop/Spark ecosystem, and Vertex AI doesn't support *training* with SparkML (but deploying existing models).

ckanaarOption: A
Sep 20, 2023

The updated answer seems A based on the following article: https://cloud.google.com/blog/topics/developers-practitioners/announcing-serverless-spark-components-vertex-ai-pipelines/

MaxNRGOption: C
Dec 19, 2023

Use Cloud Dataproc, BigQuery, and Apache Spark ML for Machine Learning https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml Using Apache Spark with TensorFlow on Google Cloud Platform https://cloud.google.com/blog/products/gcp/using-apache-spark-with-tensorflow-on-google-cloud-platform

knith66Option: C
Jul 27, 2023

If you wanted to use Vertex AI for training Spark ML models, you would typically need to convert your Spark ML code to another supported machine learning framework like TensorFlow or scikit-learn. Then you could use Vertex AI's pre-built training and prediction services for those frameworks.

barnac1esOption: C
Sep 24, 2023

Dataproc for Spark: Google Cloud Dataproc is a managed Spark and Hadoop service that allows you to run Spark jobs seamlessly on Google Cloud. It provides the flexibility to run Spark jobs using Spark MLlib and other Spark libraries. BigQuery Integration: You mentioned that your data is being migrated to BigQuery. Dataproc has native integration with BigQuery, allowing you to read data directly from BigQuery tables. This eliminates the need to export data from BigQuery to another storage system before processing it with Spark. Rapid Migration: This approach allows you to quickly migrate your existing Spark ML models and training pipelines without the need for a complete rewrite or extensive changes to your existing workflows. You can continue using your Spark ML models while adapting them to read data from BigQuery.

TakshashilaOption: C
Jun 16, 2023

Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

blathulOption: C
Jun 23, 2023

Dataproc is a managed Spark and Hadoop service on Google Cloud, which makes it an ideal choice for migrating your existing Spark ML training pipelines. By using Dataproc, you can continue to leverage Spark and its ML capabilities without the need for significant code changes or rewriting your models. By combining Dataproc and BigQuery, you can create Spark jobs or workflows in Dataproc that read data from BigQuery and train your existing Spark ML models. This approach allows you to quickly migrate your training pipelines to Google Cloud and take advantage of the scalability and performance benefits of both Dataproc and BigQuery.

wan2threeOption: A
Jul 16, 2023

Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery using standard SQL queries on existing business intelligence tools and spreadsheets, or you can export datasets from BigQuery directly into Vertex AI Workbench and run your models from there. https://cloud.google.com/vertex-ai#all-features:~:text=Data%20and%20AI%20integration

FP77Option: C
Aug 27, 2023

The answer is C. Spin up a Cloud Dataproc Cluster, migrate spark jobs to there, and link the Cluster to Bgquery with the connector. It's a straightforward solution.

DeepakVenkatachalamOption: C
Sep 21, 2023

they are talking about rapid lift and shift, in which case Dataproc cluster will be right one for Spark ML models for lift and shift. so I think the answer is C.

NandababyOption: D
Dec 17, 2023

Why not option D? To spin up the spark cluster on compute engine, considering rapid migration it potentially could be best approach as team wont have to re-work on model (may be only few configurational changes) and again to get data from Bigquery which is required periodically not all the time, could be easy. With Dataproc it would have more code changes eventually can take more time. With Vertex AI it doesn't support spark ML natively and also training would be black box. For me Answer should be D.

Matt_108Option: C
Jan 13, 2024

Option C, agreed with other comments

GCP001Option: C
Jan 17, 2024

C looks more suitable as data is alerady on BigQuery. Ref - https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml

mothkuriOption: C
Mar 2, 2024

C Question is about rapid lift and shift. So code changes should be minimul

Anudeep58Option: C
Jul 7, 2024

Vertex AI is better suited for TensorFlow or scikit-learn models. Direct Spark ML support isn't native to Vertex AI, making this a less straightforward migration path.