Certified Data Engineer Professional Exam - Question 83

Question

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:

key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG

There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.

Which of the following solutions meets the requirements?

Examice · Accepted Answer

Data should be partitioned by the topic field because this allows for access control lists (ACLs) and delete statements to be applied effectively. By partitioning by the topic, you can isolate the 'registration' topic, which contains PII, and set retention policies to delete records after 14 days. Non-PII data can remain in other partitions indefinitely, thus meeting both the privacy and data retention requirements.

mouad_attaqi · Answer

I think answer E is correct, as by default partitionning by a column will create a separate folder for each subset data linked to the partition

Dileepvikram · Answer

I think answer is E

aragorn_brego · Answer

Partitioning data by the topic field would allow the data engineering team to apply access control lists (ACLs) to restrict access to the partition containing the "registration" topic, which holds PII. Furthermore, the team can set up automated deletion policies that specifically target the partition with PII data to delete records after 14 days, without affecting the data in other partitions. This approach meets both the privacy requirements for PII and the data retention goals for non-PII information.

ervinshang · Answer

E is correct

sturcu · Answer

Correct

[Removed] · Answer

The solution that meets the requirements is: B. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.

Partitioning the data by the registration field allows the directory containing PII records to be isolated and access restricted via ACLs. Additionally, the data retention requirements can be met by setting up a separate job or process to remove PII records that are 14 days old. For non-PII records, they can be retained indefinitely utilizing Delta Lake's time travel functionality.

spaceexplorer · Answer

E is correct

ojudz08 · Answer

i think it's best to isolate the storage to avoid mistakenly deleting tables in the same storage so I go with D

Certified Data Engineer Professional Exam - Question 83

Discussion