Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 26


Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Show Answer
Correct Answer: BC

To maximize performance, it is crucial to balance the level of parallelism and the resources available to each Executor. Option C provides 16 VMs, with 25 GB of RAM and 10 cores per Executor. This configuration offers a good level of parallelism, which is beneficial for handling wide transformations that require significant data shuffling across multiple nodes. The allocation of 25 GB of RAM per Executor ensures that each Executor can handle its tasks efficiently without being resource-starved. The higher number of VMs compared to other options also improves fault tolerance and workload distribution, which are key to optimizing performance in distributed computing environments.

Discussion

15 comments
Sign in to comment
robson90Option: A
Aug 23, 2023

Option A, question is about maximum performance. Wide transformation will result in often expensive shuffle. With one executor this problem will be resolved. https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#complex-batch-etl

dp_learner
Nov 4, 2023

source : https://docs.databricks.com/en/clusters/cluster-config-best-practices.html

SantitoxicOption: D
Sep 22, 2023

Considering the need for both memory and parallelism, option D seems to offer the best balance between resources and parallel processing. It provides a reasonable amount of memory and cores per Executor while maintaining a sufficient level of parallelism with 4 Executors. This configuration is likely to result in maximum performance for a job with at least one wide transformation.

stuart_gta1Option: C
Aug 8, 2023

C. More VMs helps to distribute the workload across the cluster, which results in better fault tolerance and increase the chances of job completion.

mwyopmeOption: C
Sep 17, 2023

Sorry Response C = 16VM for maximing Wide Transformation

ismoshkovOption: A
Nov 4, 2023

Our goal is top performance. Vertical scaling is more performant rather that horizontal. Especially we know that we need cross VM exchange. Option A.

ofedOption: A
Nov 7, 2023

Option A

vikrampatel5Option: A
Jan 21, 2024

Option A: https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#complex-batch-etl

asmayassinegOption: E
Aug 2, 2023

answer should be E. if at least one transformation is wide, so 1 executor of 200GB can do the job, rest of tasks can be carried out on the other node

8605246
Aug 6, 2023

would it be fault-tolerant?

BrianNguyen95Option: E
Aug 20, 2023

correct answer is E: Option E provides a substantial amount of memory and cores per executor, allowing the job to handle wide transformations efficiently. However, performance can also be influenced by factors like the nature of your specific workload, data distribution, and overall cluster utilization. It's a good practice to conduct benchmarking and performance testing with various configurations to determine the optimal setup for your specific use case.

taif12340Option: D
Aug 23, 2023

Considering the need for both memory and parallelism, option D seems to offer the best balance between resources and parallel processing. It provides a reasonable amount of memory and cores per Executor while maintaining a sufficient level of parallelism with 4 Executors. This configuration is likely to result in maximum performance for a job with at least one wide transformation.

mwyopmeOption: B
Sep 17, 2023

Key message is : Given a job with at least one wide transformation Performance, should max the number of concurrent VM, Selecting response B. 160/10 = 16 VM

dp_learnerOption: A
Nov 4, 2023

response A. as of Complex batch ETL " More complex ETL jobs, such as processing that requires unions and joins across multiple tables, will probably work best when you can minimize the amount of data shuffled. Since reducing the number of workers in a cluster will help minimize shuffles, you should consider a smaller cluster like cluster A in the following diagram over a larger cluster like cluster D. "

dp_learner
Nov 4, 2023

source = source : https://docs.databricks.com/en/clusters/cluster-config-best-practices.html

RafaelCFCOption: A
Jan 8, 2024

robson90's response explains it perfectly and has documentation to support it.

PrashantTiwariOption: A
Feb 9, 2024

A is correct

arik90Option: A
Mar 27, 2024

Wide transformation falls under complex etl which means Option A is correct in the documentation didn't mention to do otherwise in this scenario.