Certified Data Engineer Professional Exam - Question 77

Question

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

Examice · Accepted Answer

To enable near real-time workloads using Databricks Auto Loader with automatic schema detection and evolution, the function should be set up to utilize streaming writes. This requires the use of .writeStream for continuous processing, specifying the checkpoint location for fault tolerance, and setting mergeSchema to True to allow automatic schema evolution. Option E accurately reflects this setup by including these necessary configurations.

AzureDE2522 · Answer

Please refer: https://docs.databricks.com/en/ingestion/auto-loader/schema.html

mouad_attaqi · Answer

Correct answer is E, it is a streaming write, and the default outputMode is Append (so if it's optional in this case)

hal2401me · Answer

https://notebooks.databricks.com/demos/auto-loader/01-Auto-loader-schema-evolution-Ingestion.html

vikram12apr · Answer

streamRead & StreamWrite shares the schema using checkpoint location
so cloudFiles.schemaLocation needs to be same for checkpointLocation so that we dont need to specify it manually 
also mergeSchema True make sure if any new column detected , it will be added in the target table

https://docs.databricks.com/en/ingestion/auto-loader/schema.html

Freyr · Answer

Reference: https://docs.databricks.com/en/ingestion/auto-loader/schema.html

writeStream: Ensures real-time streaming write capabilities, which is essential f
or near real-time workloads.
checkpointLocation: Necessary for fault tolerance and tracking progress.
mergeSchema: Ensures automatic schema evolution, allowing new columns to be detected and added to the target table.

Why Option 'C ' is incorrect?
Uses write instead of writeStream, which is for batch processing, making it inappropriate for real-time streaming.

Why Option 'B ' is incorrect?
Although it includes checkpointLocation and mergeSchema, the addition of trigger(once=True) is not necessary in this context, and it is better suited for batch-like processing.

Reference: https://docs.databricks.com/en/ingestion/auto-loader/schema.html

sturcu · Answer

there is a type in the statement. Is it schema or checkpoint ?
Provided answer is not correct. It has to be a writestream, with mode append

Dileepvikram · Answer

It does not mention to write as stream, it mentions to write incrementally, so option C looks correct for me

aragorn_brego · Answer

This response correctly fills in the blank to meet the specified requirements of using Databricks Auto Loader for automatic schema detection and evolution in a near real-time streaming context.

Certified Data Engineer Professional Exam - Question 77

Discussion