Certified Associate Developer for Apache Spark Exam - Question 45

Question

The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.

Code block:

storesDF.cache().count()

Examice · Accepted Answer

The cache() operation caches DataFrames at the default MEMORY_AND_DISK level. For caching strictly in memory, the persist() method should be used with the specific StorageLevel MEMORY_ONLY. The provided code uses cache(), which does not guarantee storing only in memory; thus, persist() with MEMORY_ONLY is the appropriate choice.

singh100 · Answer

E is correct. You cannot set StorageLevel Memory_only with cache(), if memory available then it keeps everything into memory else it will spill to disk. To keep everything into Memory you need to use Persist() with Storage Level Memory only.

peekaboo15 · Answer

The answer should be E. See this post for reference https://stackoverflow.com/questions/26870537/what-is-the-difference-between-cache-and-persist

4be8126 · Answer

B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().

The storage level of a DataFrame cache can be specified as an argument to the cache() operation, but if the storage level has not been specified, the default MEMORY_AND_DISK level is used. Therefore, option A is incorrect.

Option C is incorrect because caching and checkpointing are different operations in Spark. Caching stores a DataFrame in memory or on disk, while checkpointing saves a DataFrame to a reliable storage system like HDFS, which is necessary for iterative computations.

Option D is incorrect because DataFrames can be cached in memory or on disk using the cache() operation.

Option E is incorrect because cache() is the recommended method for caching DataFrames in Spark, and it supports caching at all storage levels, including MEMORY_ONLY. The persist() operation can be used to specify a storage level, but cache() is simpler and more commonly used.

Jtic · Answer

E

B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().

This option is incorrect. The storage level does not need to be set via storesDF.storageLevel prior to calling cache(). The cache() operation can be used directly on the DataFrame without explicitly setting the storage level.

E. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead.

This option is the correct answer. The error in the code block is that the cache() operation is used instead of persist(). While cache() caches DataFrames at the default MEMORY_AND_DISK level, persist() provides more flexibility by allowing different storage levels to be specified, such as MEMORY_ONLY for caching only in memory. Therefore, persist() should be used instead of cache() to achieve the desired caching behavior.

ItsAB · Answer

there are two options here: B and E. Who chose B => you can't explicitly set the storage level, it's a read-only property, so the correct answer is E.

juadaves · Answer

E
cache() -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance
    Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`).
    
    .. versionadded:: 1.3.0
    
    .. versionchanged:: 3.4.0
        Supports Spark Connect.
    
    Notes
    -----
    The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.

juliom6 · Answer

E is correct!

from pyspark.sql.types import IntegerType
from pyspark import StorageLevel

storesDF = spark.createDataFrame([2023, 2024], IntegerType())
print(storesDF.persist(StorageLevel.MEMORY_ONLY).storageLevel)

azurearch · Answer

E is wrong. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) 
note the use of 'only' here, cache can also store in disk if required.

B is also wrong, there is no condition to set storagelevel prior to calling cache()

correct answer is A.

Certified Associate Developer for Apache Spark Exam - Question 45

Discussion