Professional Data Engineer Exam - Question 21

Question

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?

Examice · Accepted Answer

The most efficient way to deduplicate data in this context is to compute the hash value of each data entry and compare it with all historical data. This ensures that duplicate entries, even with different timestamps, are identified and eliminated based on the contents of the payload alone. Using a hash value for comparison is computationally efficient and requires less storage compared to other methods such as storing entire payloads or using GUIDs. Additionally, maintaining just the hash value circumvents issues with variations in timestamps while still ensuring data integrity.

dg63 · Answer

The best answer is "A". 
Answer "D" is not as efficient or error-proof due to two reasons
1. You need to calculate hash at sender as well as at receiver end to do the comparison. Waste of computing power.
2. Even if we discount the computing power, we should note that the system is sending inventory information. Two messages sent at different can denote same inventory level (and thus have same hash). Adding sender time stamp to hash will defeat the purpose of using hash as now retried messages will have different timestamp and a different hash. 
if timestamp is used as message creation timestamp than that can also be used as a UUID.

[Removed] · Answer

Answer: D
Description: Using Hash values we can remove duplicate values from a database. Hashvalues will be same for duplicate data and thus can be easily rejected.

juliobs · Answer

Hard question.
It's a *proprietary* system. Who guarantees we can even add a GUID?
But if you can, it's definitely more efficient than calculating hashes (ignoring timestamp).

Hungry_guy · Answer

Answer is B - although the time stamp is diff for each transmission - the hash value is computed for the payload, not for the timestamp - which is just an added field for transmission. So, has val remains the same for all transmissions of the same data - which is what we can use for comparision.

So, much more efficient to just directly compare the hash values with the historical data - to check and remove duplicates - instead of again wasting space storing stuff - in option D

JustQ · Answer

B. Compute the hash value of each data entry, and compare it with all historical data.

Explanation:

Efficiency: Hashing is a fast and efficient operation, and comparing hash values is generally quicker than comparing the entire payload or using other methods.

Space Efficiency: Storing hash values requires less storage space compared to storing entire payloads or using global unique identifiers (GUIDs).

Deduplication: By computing the hash value of each data entry and comparing it with historical data, you can easily identify duplicate transmissions. If the hash value matches an existing one, it indicates that the payload is the same.

tibuenoc · Answer

As Dg63 wrote.

boca_2022 · Answer

A is best choice. D doesn't make sense.

Mark_86 · Answer

This question is formulated very badly.
From the way that A is formulated, you would not deduplicate but rather the duplicates would have the same GUID.
Then we have D, which is storing the information (assuming the hash is created without the timestamp). B is doing it right away. D only alludes to the actual deduplication. But it would be more efficient.

alihabib · Answer

Why not D ? Generate a Hash for payload entry and maintain the value as metadata. Do the validation check on Dataflow..... A GUID will generate 2 different entries for same payload entry, it will not tackle duplication check

steghe · Answer

I though the answer was A 'cos it's more efficient. But I read the answer with more attention: GUID is given "at each data entry". It's not said that GUID was given from publisher. If GUID is given in data entry (subscriber), two equal messages can have different GUID.
 D is not complete 'cos it's not so precise about hash field that are used.
I'm in doubt on this answer :-(

musumusu · Answer

Answer B:
Option A: GUIDs can deduplicate the data but is expensive and good for multiple data processing. 
Option B: Using hash function to authenticate the unique rows, this function can be applied directly in bigquery. 
Option D, is complex and more expensive.

``
`CREATE TEMP FUNCTION hashValue(input STRING) AS (
  CAST(FARM_FINGERPRINT(input) AS STRING)
);
``

AshokPalle · Answer

Just asked Chatgpt, it gave me option D

Melampos · Answer

you cannot deduplicate data adding a random guid, with guid row is distinct than others

rtcpost · Answer

D. Maintain a database table to store the hash value and other metadata for each data entry.

Storing a database table with hash values and metadata is an efficient way to deduplicate data. When new data is transmitted, you can calculate the hash of the payload and check whether it already exists in the database. This approach allows for efficient duplicate detection without the need to compare the new data with all historical data. It's a common and scalable technique used to ensure data consistency and avoid processing the same data multiple times.

Options A (assigning GUIDs to each data entry) and C (storing each data entry as the primary key) can work, but they might be less efficient than using hash values when dealing with a large volume of data. Option B (computing the hash value of each data entry and comparing it with all historical data) can be computationally expensive and slow, especially if there's a significant amount of historical data to compare against. Storing hash values in a table allows for fast and efficient deduplication.

rocky48 · Answer

Answer : A
"D" is not as efficient or error-proof due to two reasons
1. You need to calculate hash at sender as well as at receiver end to do the comparison. Waste of computing power.
2. Even if we discount the computing power, we should note that the system is sending inventory information. Two messages sent at different can denote same inventory level (and thus have same hash). Adding sender time stamp to hash will defeat the purpose of using hash as now retried messages will have different timestamp and a different hash.
if timestamp is used as message creation timestamp than that can also be used as a UUID.

TVH_Data_Engineer · Answer

To deduplicate the data most efficiently, especially in a cloud environment where the data is sent periodically and re-transmissions can occur, the recommended approach would be:

D. Maintain a database table to store the hash value and other metadata for each data entry.

This approach allows you to quickly check if an incoming data entry is a duplicate by comparing hash values, which is much faster than comparing all fields of a data entry. The metadata, which includes the timestamp and possibly other relevant information, can help resolve any ambiguities that may arise if the hash function ever produces collisions.

vbrege · Answer

1. My original vote was 'B'. I chose it over 'D' because option 'D' does not explicitly say anything about how that table will be used for deduplication. In hindsight, explicit usage of table should not be given much weightage so after review and seeing other comments, I thought of 'D' as the correct answer.

2. Now looking more clearly at option 'D' (and 'B' also), it's a little ambiguous of what keys will be used to create the hash. So, if you use the payload PLUS the timestamp, the hash is of no use. This is a little confusing

3. Finally, although I never thought this is the right option, 'A' seems to be the correct option. The GUID is created at Data entry, NOT at the transmission stage. So, the GUID should be representative of the payload only and NOT the timestamp which will make it unique per payload, not per transmission of the same payload. So, in the end, I feel like 'A' is the correct choice.

Professional Data Engineer Exam - Question 21

Discussion