Professional Data Engineer Exam - Question 17

Question

Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

Examice · Accepted Answer

To migrate an existing 30-node Apache Hadoop cluster to the cloud while minimizing cluster management and enabling data persistence beyond the cluster's life, creating a Cloud Dataproc cluster that uses the Google Cloud Storage connector is the best approach. Cloud Dataproc is a managed Hadoop service, making it easier to manage and operate compared to self-managed solutions. Using the Google Cloud Storage connector allows data to be stored in Google Cloud Storage, which remains intact even if the cluster is shut down, ensuring data persistence beyond the cluster's lifecycle.

[Removed] · Answer

Answer: D
Description: Dataproc is used to migrate Hadoop and Spark jobs on GCP. Dataproc with GCS connected through Google Cloud Storage connector helps store data after the life of the cluster. When the job is high I/O intensive, then we need to create a small persistent disk.

[Removed] · Answer

Correct : D

rtcpost · Answer

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

Google Cloud Dataproc is a managed Hadoop and Spark service that allows you to easily create and manage Hadoop clusters in the cloud. By using the Google Cloud Storage connector, you can persist data in Google Cloud Storage, which provides durable storage beyond the cluster's lifecycle. This approach ensures data is retained even if the cluster is terminated, and it allows you to reuse your existing Hadoop jobs.

Option B (Creating a Dataproc cluster that uses persistent disks for HDFS) is another valid choice. However, using Google Cloud Storage for data storage and processing is often more cost-effective and scalable, especially when migrating to the cloud.

Options A, C, and E do not take full advantage of Google Cloud's services and the benefits of cloud-native data storage and processing with Google Cloud Storage and Dataproc.

kmaiti · Answer

Two key points:
Managed hadoop cluster - dataproc
Persistent storage:  GCS (dataproc uses gcs connector to connect to gcs)

bha11111 · Answer

Hadoop --> Dataproc   Persistent storage after the processing --> GCS

Asheesh1909 · Answer

Isn't it A and D both dataflow for reusable jobs and gcs for data peraistance?

crisimenjivar · Answer

Anwer: D

nkunwar · Answer

Dataproc cluster set up will be ephemeral to run HDFS Jobs and can be killed after Job execution killing persistent storage with cluster

achafill · Answer

Correct Answer : D

assU2 · Answer

Seems like it is D. https://cloud.google.com/dataproc/docs/concepts/dataproc-hdfs
Never saw they mentioned persistent disks, although they are not deleted with the clusters...

Nirca · Answer

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

korntewin · Answer

The answer is D! Dataproc have no need for use to manage the infra and cloudstorage also no need for us to manage too!

samdhimal · Answer

D Seems right. Cloud storage can be used to achieve data storage even after the life of cluster.

kshehadyx · Answer

Correct D

suku2 · Answer

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
Dataproc clusters can be created to lift and shift existing Hadoop jobs
Data stored in Google Cloud Storage extends beyond the life of a Dataproc cluster.

imran79 · Answer

D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

Here's why:

Cloud Dataproc allows you to run Apache Hadoop jobs with minimal management. It is a managed Hadoop service.

Using the Google Cloud Storage (GCS) connector, Dataproc can access data stored in GCS, which allows data persistence beyond the life of the cluster. This means that even if the cluster is deleted, the data in GCS remains intact. Moreover, using GCS is often cheaper and more durable than using HDFS on persistent disks.

fahadminhas · Answer

Option D is incorrect, as it would not provide persistent HDFS storage within cluster itself. Rather B should be the correct answer.

Professional Data Engineer Exam - Question 17

Discussion