Certified Data Engineer Professional Exam - Question 30

Question

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Examice · Accepted Answer

The function needs to return an object that can be used to manipulate new records that have not yet been processed. Since the ingest job appends to the 'bronze' table, using the readStream method on this table allows for reading new data incrementally. The option 'spark.readStream.table("bronze")' is appropriate for this purpose as it ensures the function can handle new records as they are added to the 'bronze' table.

AzureDE2522 · Answer

# not providing a starting version/timestamp will result in the latest snapshot being fetched first
spark.readStream.format("delta") \
  .option("readChangeFeed", "true") \
  .table("myDeltaTable")
Please refer:
https://docs.databricks.com/en/delta/delta-change-data-feed.html

Laraujo2022 · Answer

In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is  def new_records():

alaverdi · Answer

In my opinion A is the correct answer. You read delta table as a stream and process only newly arrived records. This is maintained while writing the stream with the state stored in checkpoint location.
spark.readStream.table("bronze")
   .writeStream
   .format("delta")
   .outputMode("append")
   .option("checkpointLocation", "/path/to/checkpoints/")
   .toTable("silver")

DAN_H · Answer

A as Structured Streaming incrementally reads Delta tables. While a streaming query is active against a Delta table, new records are processed idempotently as new table versions commit to the source table.

vikram12apr · Answer

Please read the question again .
it is asking to get the data from bronze table to the some downstream table.
Now as its a append only daily nightly job
the filter on file name will give the new data available in bronze table which is still not flown down the pipeline.

chokthewa · Answer

E is correct.
D use invalid option refer to see sample in https://docs.databricks.com/en/delta/delta-change-data-feed.html .
A , B didn't filter ,so it will gather whole table data.
E uses the knew value to filter .

RafaelCFC · Answer

E addresses the desired filtering, while keeping with the logic of the first step being a batch job, and has no code errors.

Jay_98_11 · Answer

can't be D since no read option in CDF. https://docs.databricks.com/en/delta/delta-change-data-feed.html

adenis · Answer

A is Correct

adenis · Answer

A is Correct

guillesd · Answer

the problem here is that both A and E are correct. E just follows the previous filtering logic while A uses the readStream method which will have to maintain a checkpoint. But both can work

ojudz08 · Answer

the question here is how to manipulate new records that have not yet been processed to the next table, since the data has been ingested into the bronze table you need to check whether or not the data ingested daily is already there in the silver table, so I think answer is D. Enabling change data feed allows to track row-level changes between delta table versions

https://docs.databricks.com/en/delta/delta-change-data-feed.html

agreddy · Answer

D is correct.
https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/
CDF can be enabled on non-streaming Delta table.. "delta" is default table format.

alexvno · Answer

Probable E, but still filename not specified only folder path

arik90 · Answer

Since the ingest_daily_batch function writes to the "bronze" table in batch mode using spark.read and write operations, we should not use readStream to read from it in the subsequent function.

imatheushenrique · Answer

The E option makes more sense because all the partition would be filtered.
Can't be the options that use CDF because theres no readChangeFeed option in dataframe read

zhiva · Answer

Both E and A can be correct but in the definition of the function there are no input parameters. This means we can't use them correctly in returned statement only with the given information in the question. This is why I vote for A

Certified Data Engineer Professional Exam - Question 30

Discussion