Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 129


An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.

Which implementation can be used to efficiently update the described account_current table as part of each hourly batch job assuming there are millions of user accounts and tens of thousands of records processed hourly?

Show Answer
Correct Answer: D

To efficiently update the account_current table as described, filtering records in account_history using the last_updated field for the most recent hour processed, and filtering for the max last_login by user_id, followed by a merge statement to update or insert the most recent value for each user_id, is the optimal approach. This ensures that only the latest records based on user_id, which is the unique identifier, are considered. This method maintains data integrity and ensures that the account_current table always reflects the most recent data for each user.

Discussion

1 comment
Sign in to comment
FreyrOption: D
Jun 1, 2024

Correct Answer: D Similar to Option A, but specifically designed around the user_id, which is the primary key. This approach ensures that the account_current is always up-to-date with the latest information per user based on the primary key, reducing the risk of duplicate information and ensuring the integrity of the data with respect to the unique identifier.