Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 77


In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

Show Answer
Correct Answer: CE

To enable near real-time workloads using Databricks Auto Loader with automatic schema detection and evolution, the function should be set up to utilize streaming writes. This requires the use of .writeStream for continuous processing, specifying the checkpoint location for fault tolerance, and setting mergeSchema to True to allow automatic schema evolution. Option E accurately reflects this setup by including these necessary configurations.

Discussion

8 comments
Sign in to comment
AzureDE2522Option: E
Nov 10, 2023

Please refer: https://docs.databricks.com/en/ingestion/auto-loader/schema.html

mouad_attaqiOption: E
Oct 26, 2023

Correct answer is E, it is a streaming write, and the default outputMode is Append (so if it's optional in this case)

hal2401meOption: E
Mar 5, 2024

https://notebooks.databricks.com/demos/auto-loader/01-Auto-loader-schema-evolution-Ingestion.html

vikram12aprOption: E
Mar 9, 2024

streamRead & StreamWrite shares the schema using checkpoint location so cloudFiles.schemaLocation needs to be same for checkpointLocation so that we dont need to specify it manually also mergeSchema True make sure if any new column detected , it will be added in the target table https://docs.databricks.com/en/ingestion/auto-loader/schema.html

FreyrOption: E
May 29, 2024

Reference: https://docs.databricks.com/en/ingestion/auto-loader/schema.html writeStream: Ensures real-time streaming write capabilities, which is essential f or near real-time workloads. checkpointLocation: Necessary for fault tolerance and tracking progress. mergeSchema: Ensures automatic schema evolution, allowing new columns to be detected and added to the target table. Why Option 'C ' is incorrect? Uses write instead of writeStream, which is for batch processing, making it inappropriate for real-time streaming. Why Option 'B ' is incorrect? Although it includes checkpointLocation and mergeSchema, the addition of trigger(once=True) is not necessary in this context, and it is better suited for batch-like processing. Reference: https://docs.databricks.com/en/ingestion/auto-loader/schema.html

sturcu
Oct 25, 2023

there is a type in the statement. Is it schema or checkpoint ? Provided answer is not correct. It has to be a writestream, with mode append

DileepvikramOption: C
Nov 9, 2023

It does not mention to write as stream, it mentions to write incrementally, so option C looks correct for me

aragorn_bregoOption: E
Nov 21, 2023

This response correctly fills in the blank to meet the specified requirements of using Databricks Auto Loader for automatic schema detection and evolution in a near real-time streaming context.