DP-600 Exam - Question 45

Question

HOTSPOT -You have a Fabric workspace that uses the default Spark starter pool and runtime version 1. 2. You plan to read a CSV file named Sales_raw. csv in a lakehouse, select columns, and save the data as a Delta table to the managed area of the lakehouse. Sales_raw. csv contains 12 columns. You have the following code. For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each correct selection is worth one point.

Examice · Accepted Answer

.

metiii · Answer

1. No, this is called filter pushdown / predicate pushdown / column pruning. This config is available when reading from a columnar type like parquet, I didn't find anything related to csv, I know that you can pushdown a predicate on csv to make it only read some rows in that case it works but it probably doesn't work for selecting columns so spark will read the entire file then filters the columns.
2. Yes partitioning creates some overhead since Spark needs to create more files
3. Yes infereSchema forces spark to read the file twice once for schema and once for data

282b85d · Answer

N-N-Y
•No: The Spark engine will initially read all columns from the CSV file because the .select() transformation is applied after the data has been read into memory. Therefore, all 12 columns from Sales_raw.csv are read before the selection of specific columns is applied.
•No: Removing the partition might not necessarily reduce the execution time. While there might be some overhead in writing data to partitions, the overall impact on read performance, especially for large datasets, is usually beneficial. The query execution time for saving might be higher due to partitioning, but the read performance improvement usually outweighs this cost.
•Yes: Adding inferSchema = 'true' will increase the execution time of the query because Spark will need to read through the entire dataset to determine the data types of each column. This extra pass over the data adds to the initial read time.

XiltroX · Answer

The answer is probably YNY
1. Those are exactly the columns that are being read. So Yes
2. Removing the PartitionBy line would not result in any performance changes. So NO
3. Adding inferSchema as True WILL result in extra time in execution as it will make the engine go over the data twice (one to read data and the other time to read Schema). So YES.

calvintcy · Answer

I took the test today. This question was included, but the option 'Removing the partition will reduce the execution time of the query' has been replaced by 'Will the Year column replace the OrderDate column?'. My answer was No.

wellingtonluis · Answer

After read all file, engine will select just some. But, initially it runs the entire file.

Momoanwar · Answer

Its read not red.
This question is ambiguous would say : no no yes.
For the point 1 :  with case sensitivity sales_raw is not Sales_raw

vish9 · Answer

No, CSV will be read in full and then filtered. 
No: Using the partition by clause in Spark's Delta format can impact write performance in several ways:

Increased Write Throughput: Partitioning your data can potentially increase write throughput by distributing the write workload across multiple partitions. This parallelism allows Spark to write data to different partitions concurrently, improving overall write performance, especially when dealing with large datasets.
Y. Infer schema will slow the performance

stilferx · Answer

IMHO, 
1. N
2. N - arguable
3. Y

1 No - because it is CSV. It will be read in full (in contrast to parquet)
2 No - well, maybe 0.5% slower due to creating a new files. But actually - no
3 Yes - because infering schema - it is additional process

estrelle2008 · Answer

full of typos, this one.
Anyhow, my guess:
YNN
inferSchema=true helps automatically determine column data types, but it needs a extra pass over the data, which comes with a slight query performance cost. So last statement = No

dp600 · Answer

I would go with NYY.

It's a CSV it is a row format, I don't think you can separate it by columns before reading the entire content.

Partitioning takes extra work, so it may slow down the proccess.

InferSchema requires an extra scan of the document or I think so, so maybe, I will go with yes.

DilumD · Answer

1. Yes: Reason: Select columns: The code selects specific columns from the DataFrame  using the select method. The selected columns are "SalesOrderNumber", "OrderDate", "CustomerName", and "UnitPrice".
2. Yes: Reason: removing the partitionBy will simplify the process. Partitioning data involves some overhead in organizing the data into separate folders/files based on the partitioning column.
3. No: Reason: Potentially Slower: Enabling inferSchema generally results in a slightly slower initial read operation. This is because Spark needs to do an additional scan of a portion of your data to analyze and determine data types before loading it.

mnc_1997 · Answer

just tried it, it only writes the columns that were selected.
Answer: YNY

Pegooli · Answer

I'm going to Y-N-N

DP-600 Exam - Question 45

Discussion