The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.
Code block:
storesDF.cache().count()
The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.
Code block:
storesDF.cache().count()
The cache() operation caches DataFrames at the default MEMORY_AND_DISK level. For caching strictly in memory, the persist() method should be used with the specific StorageLevel MEMORY_ONLY. The provided code uses cache(), which does not guarantee storing only in memory; thus, persist() with MEMORY_ONLY is the appropriate choice.
E is correct. You cannot set StorageLevel Memory_only with cache(), if memory available then it keeps everything into memory else it will spill to disk. To keep everything into Memory you need to use Persist() with Storage Level Memory only.
The answer should be E. See this post for reference https://stackoverflow.com/questions/26870537/what-is-the-difference-between-cache-and-persist
No, option E is incorrect. The cache() method is the appropriate method to cache a DataFrame in Spark's memory, and it can cache DataFrames at the MEMORY_ONLY level if that's what is desired. The persist() method is a more general-purpose method that allows the user to specify other storage levels (such as MEMORY_AND_DISK), but it is not required for this task.
You should use storesDF.persist(StorageLevel.MEMORY_ONLY).count()
E is wrong. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) note the use of 'only' here, cache can also store in disk if required. B is also wrong, there is no condition to set storagelevel prior to calling cache() correct answer is A.
E is correct! from pyspark.sql.types import IntegerType from pyspark import StorageLevel storesDF = spark.createDataFrame([2023, 2024], IntegerType()) print(storesDF.persist(StorageLevel.MEMORY_ONLY).storageLevel)
E cache() -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`). .. versionadded:: 1.3.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Notes ----- The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
there are two options here: B and E. Who chose B => you can't explicitly set the storage level, it's a read-only property, so the correct answer is E.
E B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache(). This option is incorrect. The storage level does not need to be set via storesDF.storageLevel prior to calling cache(). The cache() operation can be used directly on the DataFrame without explicitly setting the storage level. E. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead. This option is the correct answer. The error in the code block is that the cache() operation is used instead of persist(). While cache() caches DataFrames at the default MEMORY_AND_DISK level, persist() provides more flexibility by allowing different storage levels to be specified, such as MEMORY_ONLY for caching only in memory. Therefore, persist() should be used instead of cache() to achieve the desired caching behavior.
B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache(). The storage level of a DataFrame cache can be specified as an argument to the cache() operation, but if the storage level has not been specified, the default MEMORY_AND_DISK level is used. Therefore, option A is incorrect. Option C is incorrect because caching and checkpointing are different operations in Spark. Caching stores a DataFrame in memory or on disk, while checkpointing saves a DataFrame to a reliable storage system like HDFS, which is necessary for iterative computations. Option D is incorrect because DataFrames can be cached in memory or on disk using the cache() operation. Option E is incorrect because cache() is the recommended method for caching DataFrames in Spark, and it supports caching at all storage levels, including MEMORY_ONLY. The persist() operation can be used to specify a storage level, but cache() is simpler and more commonly used.
Wrong explanation. you can call cache() or persist() without set storage level, it will use default Memoery_and_disk. You clearly misunderstand the question itself. storesDF.cache().count() is a workable code, but fail the requirement. This is the issue. The question asked "only in memory", that means, if the data size is out of the memory, i do not want to store it in disk, but rather recompute. Therefore, you need to specifically set the storage level as "MEMORY ONLY". A is correct