Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 83


All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:

key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG

There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.

Which of the following solutions meets the requirements?

Show Answer
Correct Answer: DE

Data should be partitioned by the topic field because this allows for access control lists (ACLs) and delete statements to be applied effectively. By partitioning by the topic, you can isolate the 'registration' topic, which contains PII, and set retention policies to delete records after 14 days. Non-PII data can remain in other partitions indefinitely, thus meeting both the privacy and data retention requirements.

Discussion

8 comments
Sign in to comment
mouad_attaqiOption: E
Oct 28, 2023

I think answer E is correct, as by default partitionning by a column will create a separate folder for each subset data linked to the partition

DileepvikramOption: E
Nov 9, 2023

I think answer is E

aragorn_bregoOption: E
Nov 21, 2023

Partitioning data by the topic field would allow the data engineering team to apply access control lists (ACLs) to restrict access to the partition containing the "registration" topic, which holds PII. Furthermore, the team can set up automated deletion policies that specifically target the partition with PII data to delete records after 14 days, without affecting the data in other partitions. This approach meets both the privacy requirements for PII and the data retention goals for non-PII information.

ervinshangOption: E
Dec 25, 2023

E is correct

sturcuOption: D
Oct 25, 2023

Correct

sturcu
Oct 30, 2023

https://docs.databricks.com/en/data-governance/table-acls/object-privileges.html#securable-objects

[Removed]Option: B
Oct 29, 2023

The solution that meets the requirements is: B. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory. Partitioning the data by the registration field allows the directory containing PII records to be isolated and access restricted via ACLs. Additionally, the data retention requirements can be met by setting up a separate job or process to remove PII records that are 14 days old. For non-PII records, they can be retained indefinitely utilizing Delta Lake's time travel functionality.

mouad_attaqi
Oct 29, 2023

There is no such thing as Registration field, it's a distinct topic

sturcu
Oct 30, 2023

you cannot restricts privileges. with ACLs on a partition. Documentations states that Securable objects in the Hive metastore are: DB, Tables, Views, Functions: https://docs.databricks.com/en/data-governance/table-acls/object-privileges.html#securable-objects

spaceexplorerOption: E
Jan 25, 2024

E is correct

ojudz08Option: D
Feb 15, 2024

i think it's best to isolate the storage to avoid mistakenly deleting tables in the same storage so I go with D