Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 172


The data engineer is using Spark's MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the Spark UI's Storage tab to signal that a cached table is not performing optimally?

Show Answer
Correct Answer: C

If the Spark MEMORY_ONLY storage level is being used, any data that is spilled to disk indicates that there is insufficient memory to store all the data in memory, which directly contradicts the purpose of using MEMORY_ONLY. Therefore, if the Size on Disk is greater than 0, it signals that the cached table is not performing optimally.

Discussion

8 comments
Sign in to comment
MDWPartners
May 29, 2024

I would say C

hpkrOption: C
Jun 12, 2024

C is correct here

HadilerOption: C
Jul 29, 2024

C is correct

imatheushenrique
Jun 1, 2024

B. This annotation says that some partitions of the cached data have been spilled to disk because there wasn't enough memory to keep them.

FreyrOption: B
Jun 2, 2024

Correct Answer: B Option B, is the most correct and relevant choice for an indicator that a cached table is not performing optimally in a MEMORY_ONLY scenario. If an RDD block includes a "?" annotation, it strongly suggests issues with caching, which would directly impact the performance and expected behavior of MEMORY_ONLY caching. This indication points to a failure to cache the data entirely in memory, which is what MEMORY_ONLY intends to do. Option C, could also be a relevant indicator in general caching scenarios (e.g., MEMORY_AND_DISK), but it contradicts the MEMORY_ONLY setting directly. Therefore, Option B is chosen based on the specific storage level described.

Freyr
Jun 10, 2024

*THE CORRECT ANSWER IS: C* PLEASE IGNORE MY PREVIOUS ANSWER. Long story short, B is correct in the context of non-functional requirement, but the question is based in functional requirement, and sorry for the confusion.

03355a2Option: C
Jun 27, 2024

It's simple, if MEMORY_ONLY is used, anything spilled to disk would indicate a problem.

03355a2
Jun 27, 2024

The RDD answer is incorrect for this question due to the fact that while this indicates a failure to cache, it is more specific to identifying individual blocks that failed to cache rather than providing a general signal of a suboptimal performance for the entire cached table.

RuiCarvalhoDEVOption: C
Nov 21, 2024

is MEMORY_ONLY

KadELbiedOption: B
May 2, 2025

Correct Answer: B In the Spark UI's Storage tab, an indicator that a cached table is not performing optimally would be the presence of the _disk annotation in the RDD Block Name. This annotation indicates that some partitions of the cached data have been spilled to disk because there wasn't enough memory to hold them. This is suboptimal because accessing data from disk is much slower than from memory. The goal of caching is to keep data in memory for fast access, and a spill to disk means that this goal is not fully achieved.

KadELbied
May 2, 2025

sorry it's C