Certified Associate Developer for Apache Spark Exam QuestionsBrowse all questions from this exam

Certified Associate Developer for Apache Spark Exam - Question 14


Of the following situations, in which will it be most advantageous to store DataFrame df at the MEMORY_AND_DISK storage level rather than the MEMORY_ONLY storage level?

Show Answer
Correct Answer: BD

When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan. The MEMORY_AND_DISK storage level stores excess data on disk, making it advantageous when recomputation is more time-consuming than reading from disk.

Discussion

9 comments
Sign in to comment
sousoukaOption: D
Mar 29, 2023

D. When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.

ZSunOption: D
Jun 7, 2023

All other explanation is either wrong or misleading. To understand the question, you need to understand the difference between Memory_only and Memory_and_Disk 1. Memory_and_Disk, which is the default mode for cache ro persist. That means, if the data size is larger than the memory, it will store the extra data in disk. next time when we n eed to read data, we will read data firstly from memory, and then read from disk. 2. Memory_Only means, if the data size is larger than memory, it will not store the extra data. next time we read data, we will read from memory first and then recompute the extra data which cannot store in memory. PS. Mr. 4be8126 is wrong about raising error when out of memory. Therefore, the difference/balance between Memory_only and memory_and_disk lay in how they handle the extra data out of memory. which is option D, if it is faster to read data from disk is faster than recompute it, then memory_and_disk.

IndieeOption: D
Apr 25, 2023

Answer is D. This is the whole idea behind caching

sly75Option: B
May 3, 2023

Yes but what about the link with the question ? I would say B too :)

4be8126Option: D
May 3, 2023

The most advantageous situation to store a DataFrame at the MEMORY_AND_DISK storage level instead of the MEMORY_ONLY storage level is option D - when it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan. This is because the MEMORY_ONLY storage level only stores data in memory, which can result in an out-of-memory error if the data exceeds the available memory. On the other hand, the MEMORY_AND_DISK storage level will spill data to disk if there is not enough memory available, allowing more data to be processed without errors. In situations where the computed data can fit entirely into memory, it is best to use the MEMORY_ONLY storage level as it will be faster than reading from disk. However, when there is not enough memory to store all the computed data, it may be necessary to use the MEMORY_AND_DISK storage level.

SonicBoom10C9Option: D
May 15, 2023

If it's faster to read from memory and can fit in, then there is no reason to use Memory_and_disk, Memory_only is sufficient. Also, if it's faster to compute than read from disk, that's what you would do. The only options is when it's too big to fit in memory and too expensive to recompute, so reading from disk (or rather caching from disk into memory on the fly) is faster.

singh100Option: D
Aug 1, 2023

D. It is faster to read the computed data from disk instead of recomputing it based on its logical plan when the recomputation is costly and time-consuming.

astone42Option: D
Aug 9, 2023

D is correct

newusernameOption: D
Nov 5, 2023

D is correct