Certified Data Engineer Associate Exam - Question 34

Question

A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.

Which of the following tools can the data engineer use to solve this problem?

Examice · Accepted Answer

To solve the problem of identifying and ingesting only the new files in each pipeline run, the data engineer can use Auto Loader. Auto Loader is a Databricks feature that incrementally processes new data files as they arrive in cloud storage. It efficiently handles file detection and ingestion without requiring modifications to existing data or additional setup, making it the most suitable tool for this scenario.

XiltroX · Answer

E is the correct answer.

surrabhi_4 · Answer

option E

AndreFR · Answer

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup.

https://docs.databricks.com/en/ingestion/auto-loader/index.html

DavidRou · Answer

Autoloader can help if you want to ingest data incrementally.

Huroye · Answer

the  data engineer needs to identify which files are new since the previous run. This seems to be an analysis effort. If that is the case, and I might be wrong, then DB SQL is the correct answer.

SerGrey · Answer

E is correct

benni_ale · Answer

E is correct

Certified Data Engineer Associate Exam - Question 34

Discussion