Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 25

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

    Correct Answer: D

    When the minimum and median durations for tasks are roughly the same, but the maximum duration is significantly longer, it indicates that some tasks are taking much longer than others. This is typically due to data skew, where some partitions have more data than others. As a result, tasks processing the larger partitions take significantly more time, causing the overall job to take longer. Therefore, the increased duration of the overall job is caused by skew due to more data being assigned to a subset of Spark partitions.

Discussion
vikram12aprOption: D

because a particular executors are executing majority of data while rest are processing very less. The total execution time depends upon the slowest executors. Answer is D.

EertyyOption: D

D is the correct answer

imatheushenriqueOption: D

D. Skew caused by more data being assigned to a subset of spark-partitions.

Jay_98_11Option: D

correct

kz_dataOption: D

I think D is correct

sturcuOption: D

D is correct