Certified Associate Developer for Apache Spark Exam - Question 14

Question

Of the following situations, in which will it be most advantageous to store DataFrame df at the MEMORY_AND_DISK storage level rather than the MEMORY_ONLY storage level?

Examice · Accepted Answer

When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan. The MEMORY_AND_DISK storage level stores excess data on disk, making it advantageous when recomputation is more time-consuming than reading from disk.

sousouka · Answer

D. When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.

ZSun · Answer

All other explanation is either wrong or misleading. To understand the question, you need to understand the difference between Memory_only and Memory_and_Disk
1. Memory_and_Disk, which is the default mode for cache ro persist. That means, if the data size is larger than the memory, it will store the extra data in disk. next time when we n eed to read data, we will read data firstly from memory, and then read from disk.
2. Memory_Only means, if the data size is larger than memory, it will not store the extra data. next time we read data, we will read from memory first and then recompute the extra data which cannot store in memory.
PS. Mr. 4be8126 is wrong about raising error when out of memory.
Therefore, the difference/balance between Memory_only and memory_and_disk lay in how they handle the extra data out of memory. which is option D, if it is faster to read data from disk is faster than recompute it, then memory_and_disk.

Indiee · Answer

Answer is D. This is the whole idea behind caching

sly75 · Answer

Yes but what about the link with the question ? I would say B too :)

4be8126 · Answer

The most advantageous situation to store a DataFrame at the MEMORY_AND_DISK storage level instead of the MEMORY_ONLY storage level is option D - when it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.

This is because the MEMORY_ONLY storage level only stores data in memory, which can result in an out-of-memory error if the data exceeds the available memory. On the other hand, the MEMORY_AND_DISK storage level will spill data to disk if there is not enough memory available, allowing more data to be processed without errors.

In situations where the computed data can fit entirely into memory, it is best to use the MEMORY_ONLY storage level as it will be faster than reading from disk. However, when there is not enough memory to store all the computed data, it may be necessary to use the MEMORY_AND_DISK storage level.

SonicBoom10C9 · Answer

If it's faster to read from memory and can fit in, then there is no reason to use Memory_and_disk, Memory_only is sufficient. Also, if it's faster to compute than read from disk, that's what you would do. The only options is when it's too big to fit in memory and too expensive to recompute, so reading from disk (or rather caching from disk into memory on the fly) is faster.

singh100 · Answer

D.  It is faster to read the computed data from disk instead of recomputing it based on its logical plan when the recomputation is costly and time-consuming.

astone42 · Answer

D is correct

newusername · Answer

D is correct

Certified Associate Developer for Apache Spark Exam - Question 14

Discussion