Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 30

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

    Correct Answer: A

    The function needs to return an object that can be used to manipulate new records that have not yet been processed. Since the ingest job appends to the 'bronze' table, using the readStream method on this table allows for reading new data incrementally. The option 'spark.readStream.table("bronze")' is appropriate for this purpose as it ensures the function can handle new records as they are added to the 'bronze' table.

Discussion
Laraujo2022Option: E

In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is def new_records():

AzureDE2522Option: D

# not providing a starting version/timestamp will result in the latest snapshot being fetched first spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .table("myDeltaTable") Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html

DAN_HOption: A

A as Structured Streaming incrementally reads Delta tables. While a streaming query is active against a Delta table, new records are processed idempotently as new table versions commit to the source table.

alaverdiOption: A

In my opinion A is the correct answer. You read delta table as a stream and process only newly arrived records. This is maintained while writing the stream with the state stored in checkpoint location. spark.readStream.table("bronze") .writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/path/to/checkpoints/") .toTable("silver")

vikram12aprOption: E

Please read the question again . it is asking to get the data from bronze table to the some downstream table. Now as its a append only daily nightly job the filter on file name will give the new data available in bronze table which is still not flown down the pipeline.

zhivaOption: A

Both E and A can be correct but in the definition of the function there are no input parameters. This means we can't use them correctly in returned statement only with the given information in the question. This is why I vote for A

imatheushenriqueOption: E

The E option makes more sense because all the partition would be filtered. Can't be the options that use CDF because theres no readChangeFeed option in dataframe read

arik90Option: E

Since the ingest_daily_batch function writes to the "bronze" table in batch mode using spark.read and write operations, we should not use readStream to read from it in the subsequent function.

alexvnoOption: E

Probable E, but still filename not specified only folder path

agreddyOption: D

D is correct. https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ CDF can be enabled on non-streaming Delta table.. "delta" is default table format.

ojudz08Option: D

the question here is how to manipulate new records that have not yet been processed to the next table, since the data has been ingested into the bronze table you need to check whether or not the data ingested daily is already there in the silver table, so I think answer is D. Enabling change data feed allows to track row-level changes between delta table versions https://docs.databricks.com/en/delta/delta-change-data-feed.html

guillesd

the problem here is that both A and E are correct. E just follows the previous filtering logic while A uses the readStream method which will have to maintain a checkpoint. But both can work

adenisOption: A

A is Correct

adenisOption: A

A is Correct

Jay_98_11Option: E

can't be D since no read option in CDF. https://docs.databricks.com/en/delta/delta-change-data-feed.html

mht3336

spark.read.format("delta") \ .option("readChangeFeed", "true") \ .option("startingVersion", 0) \ .option("endingVersion", 10) \ .table("myDeltaTable")

RafaelCFCOption: E

E addresses the desired filtering, while keeping with the logic of the first step being a batch job, and has no code errors.

chokthewaOption: E

E is correct. D use invalid option refer to see sample in https://docs.databricks.com/en/delta/delta-change-data-feed.html . A , B didn't filter ,so it will gather whole table data. E uses the knew value to filter .