Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 30


A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Show Answer
Correct Answer: AD

The function needs to return an object that can be used to manipulate new records that have not yet been processed. Since the ingest job appends to the 'bronze' table, using the readStream method on this table allows for reading new data incrementally. The option 'spark.readStream.table("bronze")' is appropriate for this purpose as it ensures the function can handle new records as they are added to the 'bronze' table.

Discussion

17 comments
Sign in to comment
AzureDE2522Option: D
Nov 12, 2023

# not providing a starting version/timestamp will result in the latest snapshot being fetched first spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .table("myDeltaTable") Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html

Laraujo2022Option: E
Nov 16, 2023

In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is def new_records():

alaverdiOption: A
Dec 13, 2023

In my opinion A is the correct answer. You read delta table as a stream and process only newly arrived records. This is maintained while writing the stream with the state stored in checkpoint location. spark.readStream.table("bronze") .writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/path/to/checkpoints/") .toTable("silver")

DAN_HOption: A
Feb 1, 2024

A as Structured Streaming incrementally reads Delta tables. While a streaming query is active against a Delta table, new records are processed idempotently as new table versions commit to the source table.

vikram12aprOption: E
Mar 8, 2024

Please read the question again . it is asking to get the data from bronze table to the some downstream table. Now as its a append only daily nightly job the filter on file name will give the new data available in bronze table which is still not flown down the pipeline.

chokthewaOption: E
Oct 21, 2023

E is correct. D use invalid option refer to see sample in https://docs.databricks.com/en/delta/delta-change-data-feed.html . A , B didn't filter ,so it will gather whole table data. E uses the knew value to filter .

RafaelCFCOption: E
Jan 5, 2024

E addresses the desired filtering, while keeping with the logic of the first step being a batch job, and has no code errors.

Jay_98_11Option: E
Jan 13, 2024

can't be D since no read option in CDF. https://docs.databricks.com/en/delta/delta-change-data-feed.html

mht3336
Jan 25, 2024

spark.read.format("delta") \ .option("readChangeFeed", "true") \ .option("startingVersion", 0) \ .option("endingVersion", 10) \ .table("myDeltaTable")

adenisOption: A
Jan 31, 2024

A is Correct

adenisOption: A
Jan 31, 2024

A is Correct

guillesd
Feb 5, 2024

the problem here is that both A and E are correct. E just follows the previous filtering logic while A uses the readStream method which will have to maintain a checkpoint. But both can work

ojudz08Option: D
Feb 14, 2024

the question here is how to manipulate new records that have not yet been processed to the next table, since the data has been ingested into the bronze table you need to check whether or not the data ingested daily is already there in the silver table, so I think answer is D. Enabling change data feed allows to track row-level changes between delta table versions https://docs.databricks.com/en/delta/delta-change-data-feed.html

agreddyOption: D
Feb 21, 2024

D is correct. https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ CDF can be enabled on non-streaming Delta table.. "delta" is default table format.

alexvnoOption: E
Mar 13, 2024

Probable E, but still filename not specified only folder path

arik90Option: E
Mar 27, 2024

Since the ingest_daily_batch function writes to the "bronze" table in batch mode using spark.read and write operations, we should not use readStream to read from it in the subsequent function.

imatheushenriqueOption: E
Jun 4, 2024

The E option makes more sense because all the partition would be filtered. Can't be the options that use CDF because theres no readChangeFeed option in dataframe read

zhivaOption: A
Jun 27, 2024

Both E and A can be correct but in the definition of the function there are no input parameters. This means we can't use them correctly in returned statement only with the given information in the question. This is why I vote for A