The code block shown below contains an error. The code block intended to read a parquet at the file path filePath into a DataFrame. Identify the error.
Code block:
spark.read.load(filePath, source – "parquet")
The code block shown below contains an error. The code block intended to read a parquet at the file path filePath into a DataFrame. Identify the error.
Code block:
spark.read.load(filePath, source – "parquet")
The load() method in PySpark's DataFrameReader class does not have a 'source' parameter. Instead, the appropriate parameter name is 'format', and its default value is 'parquet'. Therefore, the 'source' parameter should be removed, and the default format will be used.
The correct code block to read a parquet file would be spark.read.parquet(filePath).
Answer should be E. Removing source and default is 'parquet' anyway. However, it is not ideal to use load, rather the respective method. https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrameReader.load.html?highlight=dataframereader%20load#pyspark.sql.DataFrameReader.load
Intention is to read a parquet at the file path filePath into a DataFrame
The parameters for load() function are: path, format, schema, **options A. Overall it makes sense, but do we really need to use schema? B. There is load operation, that's FALSE C. read is used without parenthesis, FALSE D. It should indeed, but there's no source parameter, FALSE E. That's true, but we need to put quotes for the filePath, then it's FALSE Makes it A, but the question is really strange and not clear.
UPD - parquet already has schema in it, it's not needed, then, I don't know what the answer is then
1. pyspark.sql.SparkSession.read Returns a DataFrameReader https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.read.html#pyspark.sql.SparkSession.read 2. we check this DataFrameReader, it contains both "load" and "parquet" methods. 2.1. for load, load(path, format, schema) https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.load.html#pyspark.sql.DataFrameReader.load Therefore, the answer is A or E. Typically parquet contains schema information. I do not like this question, because if reading a parquet file, directly use spark.read.parquet()
E is correct. The "format" parameter should be used instead of "source" (default "parquet"): https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.load.html format: str, optional optional string for format of the data source. Default to ‘parquet’.
I would go for E
spark.read.load(PARQUET_PATH,format='parquet') Load is valid, if provided with format.