Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 138

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

    Correct Answer: D

    When the minimum and median times for task completion are roughly the same, but there is a significant outlier with the maximum duration being much higher, it suggests that a small number of tasks are taking disproportionately longer to complete. This is indicative of skew, where more data is assigned to a subset of Spark partitions, resulting in those partitions taking significantly longer to process. This would cause the overall job duration to increase as the job cannot complete until all tasks are finished. Therefore, the increased duration is being caused by skew due to uneven data distribution.

Discussion
vexor3Option: D

D is correct