Which of the following cluster configurations is most likely to experience an out-of-memory error in response to data skew in a single partition?
Note: each configuration has roughly the same compute power using 100 GB of RAM and 200 cores.
Which of the following cluster configurations is most likely to experience an out-of-memory error in response to data skew in a single partition?
Note: each configuration has roughly the same compute power using 100 GB of RAM and 200 cores.
Data skew refers to an uneven distribution of data across partitions. When there is significant skew in a single partition, it can lead to increased memory usage for that specific partition, potentially causing out-of-memory errors. Scenario #6 is the most likely to experience an out-of-memory error in response to data skew because it has the smallest memory per executor, with only 12.5 GB of RAM per executor. In this configuration, even though the total available memory is 100 GB (similar to other scenarios), the reduced memory per executor increases the risk of encountering out-of-memory errors when handling skewed data in a single partition.
The most likely scenario to experience an out-of-memory error in response to data skew in a single partition is: C. Scenario #6: 12.5 GB Worker Node, 12.5 GB Executor. 1 Driver & 8 Executors. Explanation: Data skew refers to an uneven distribution of data across partitions. When there is significant skew in a single partition, it can lead to increased memory usage for that specific partition, potentially causing out-of-memory errors. The smaller the available memory per executor, the higher the likelihood of encountering such issues. In this case, Scenario #6 has the smallest worker node and executor configuration, with only 12.5 GB of RAM available for each executor. With 8 executors, the total available memory is still 100 GB (similar to other scenarios), but the reduced memory per executor increases the risk of encountering out-of-memory errors when handling skewed data in a single partition.
Option A, Scenario #4, has larger worker nodes and executors compared to Scenario #6, reducing the likelihood of encountering out-of-memory errors due to data skew. Option B, Scenario #5, also has larger worker nodes and executors compared to Scenario #6, providing more memory per executor and reducing the risk of out-of-memory errors. Option D states that more information is needed to determine an answer, but based on the available information, Scenario #6 is the most likely to experience out-of-memory errors due to data skew in a single partition. Option E, Scenario #1, has larger worker nodes and executors compared to Scenario #6, reducing the likelihood of out-of-memory errors due to data skew.
Data skew is when you have a few partitions oversized. But due to initial partitioning this large datasets needed to be processed by single threads so can cause OOM
D is correct. even though you have less executor memory in scenario 6, spark will still complete the process , it might take more time to do the shuffle neverthless.
This is the right answer.
Please explain the answer!!