Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 106


A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:

user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRING

The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.

Which solution minimizes the compute costs to propagate this batch of data?

Show Answer
Correct Answer: C

Using Delta Lake version history to get the difference between the latest version of reviews_raw and the one version prior minimizes compute costs. This approach only processes the newly added data since the last batch, avoiding the need to process the entire dataset or maintain a complex streaming setup, thus offering a cost-effective solution for this specific batch processing job.

Discussion

5 comments
Sign in to comment
divingbell17Option: B
Jan 2, 2024

B should be correct. https://www.databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

Istiaque
Jan 3, 2024

It is a batch process.

bacckomOption: A
Jan 11, 2024

Should we consider deduplicate? For Time travel, I don't think it can be used to duplicate the target table.

ranithOption: B
Jan 26, 2024

B should be correct when looking at cost minimalization, a batch read would scan the whole reviews_raw table, this is unnecessary as historical data is not changed. If a review is delyaed to be approved by the moderator still it is inserted as a new record. Capturing the new data is sufficient.

alexvnoOption: A
Mar 14, 2024

Deduplication , so insert-only merge

spaceexplorerOption: B
Jan 27, 2024

B is correct