Exam Certified Data Engineer Professional All QuestionsBrowse all questions from this exam
Question 14

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

    Correct Answer: C

    To efficiently update the account_current table, the correct approach involves filtering the records in the account_history table using the last_updated field from the most recent hour processed. This ensures that only the latest information within the specified timeframe is reviewed. A merge statement based on user_id, which is a unique key, allows the system to update or insert the most recent information for each user_id, ensuring that the account_current table reflects the latest data for each user. This approach aligns with the Type 1 table behavior where updates for existing records replace old data with new data, without maintaining historical versions.

Discussion
terrkuOption: B

Type 1 table means the behavior is overwriting.

RafaelCFCOption: C

My reasoning is thus: The application is based on batch processes, so A is wrong. Overwriting the table would destroy the Type 1 SCD behavior, so B is wrong. Comparing versions of account_history would not be efficient, as the whole data would be scanned, so D is wrong. 'username' is not a key column, so we have no guarantee that it's unique, thus de-duplicating by it can yield wrongly grouped sets of rows, so E is not a safe bet, with the information we know. C is the best option.

thxsgodOption: C

Correct

DAN_HOption: C

answer is C

sturcuOption: E

We need to filter on last hours and deduplicate records, then merge. Do is not correct, filtering on max loggin_date makes no sense.

sturcu

If the "las log in" is the column that shows the lates version of the record then answer c is correct

petrv

deduplication on username does not make sense, username is not PK.

Karunakaran_ROption: B

I think B ,Type 1 table must overwrite the data

FreyrOption: C

C is correct. A Type 1 table means that it performs an "upsert" operation without maintaining history, based on the merge condition. This means that new records are inserted, and existing records are updated. As a result, the merge process does not retain historical records. Therefore, the correct answer is C.

PrashantTiwariOption: C

C is correct

spaceexplorerOption: C

answer is C

kz_dataOption: C

Correct answer is C

ATLTennisOption: D

D is the most optimal way to identify the changes in the last data refresh

EertyyOption: C

correct answer is C