Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 17


Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

Show Answer
Correct Answer: D

To migrate an existing 30-node Apache Hadoop cluster to the cloud while minimizing cluster management and enabling data persistence beyond the cluster's life, creating a Cloud Dataproc cluster that uses the Google Cloud Storage connector is the best approach. Cloud Dataproc is a managed Hadoop service, making it easier to manage and operate compared to self-managed solutions. Using the Google Cloud Storage connector allows data to be stored in Google Cloud Storage, which remains intact even if the cluster is shut down, ensuring data persistence beyond the cluster's lifecycle.

Discussion

17 comments
Sign in to comment
[Removed]Option: D
Mar 27, 2020

Answer: D Description: Dataproc is used to migrate Hadoop and Spark jobs on GCP. Dataproc with GCS connected through Google Cloud Storage connector helps store data after the life of the cluster. When the job is high I/O intensive, then we need to create a small persistent disk.

[Removed]Option: D
Mar 16, 2020

Correct : D

rtcpostOption: D
Oct 22, 2023

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector. Google Cloud Dataproc is a managed Hadoop and Spark service that allows you to easily create and manage Hadoop clusters in the cloud. By using the Google Cloud Storage connector, you can persist data in Google Cloud Storage, which provides durable storage beyond the cluster's lifecycle. This approach ensures data is retained even if the cluster is terminated, and it allows you to reuse your existing Hadoop jobs. Option B (Creating a Dataproc cluster that uses persistent disks for HDFS) is another valid choice. However, using Google Cloud Storage for data storage and processing is often more cost-effective and scalable, especially when migrating to the cloud. Options A, C, and E do not take full advantage of Google Cloud's services and the benefits of cloud-native data storage and processing with Google Cloud Storage and Dataproc.

kmaitiOption: D
Apr 11, 2022

Two key points: Managed hadoop cluster - dataproc Persistent storage: GCS (dataproc uses gcs connector to connect to gcs)

bha11111Option: D
Mar 11, 2023

Hadoop --> Dataproc Persistent storage after the processing --> GCS

Asheesh1909
Jun 6, 2022

Isn't it A and D both dataflow for reusable jobs and gcs for data peraistance?

crisimenjivarOption: D
Aug 19, 2022

Anwer: D

nkunwarOption: D
Sep 28, 2022

Dataproc cluster set up will be ephemeral to run HDFS Jobs and can be killed after Job execution killing persistent storage with cluster

achafillOption: D
Oct 17, 2022

Correct Answer : D

assU2Option: D
Nov 5, 2022

Seems like it is D. https://cloud.google.com/dataproc/docs/concepts/dataproc-hdfs Never saw they mentioned persistent disks, although they are not deleted with the clusters...

assU2
Nov 5, 2022

although: By default, when no local SSDs are provided, HDFS data and intermediate shuffle data is stored on VM boot disks, which are Persistent Disks.

assU2
Nov 5, 2022

and it says that only VM Boot disks are deleted when the cluster is deleted.

NircaOption: D
Dec 11, 2022

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

korntewinOption: D
Jan 7, 2023

The answer is D! Dataproc have no need for use to manage the infra and cloudstorage also no need for us to manage too!

samdhimalOption: D
Jan 11, 2023

D Seems right. Cloud storage can be used to achieve data storage even after the life of cluster.

kshehadyxOption: D
Sep 15, 2023

Correct D

suku2Option: D
Sep 15, 2023

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector. Dataproc clusters can be created to lift and shift existing Hadoop jobs Data stored in Google Cloud Storage extends beyond the life of a Dataproc cluster.

imran79Option: D
Oct 7, 2023

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector. Here's why: Cloud Dataproc allows you to run Apache Hadoop jobs with minimal management. It is a managed Hadoop service. Using the Google Cloud Storage (GCS) connector, Dataproc can access data stored in GCS, which allows data persistence beyond the life of the cluster. This means that even if the cluster is deleted, the data in GCS remains intact. Moreover, using GCS is often cheaper and more durable than using HDFS on persistent disks.

fahadminhasOption: B
Jun 28, 2024

Option D is incorrect, as it would not provide persistent HDFS storage within cluster itself. Rather B should be the correct answer.