Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 250


Your company's data platform ingests CSV file dumps of booking and user profile data from upstream sources into Cloud Storage. The data analyst team wants to join these datasets on the email field available in both the datasets to perform analysis. However, personally identifiable information (PII) should not be accessible to the analysts. You need to de-identify the email field in both the datasets before loading them into BigQuery for analysts. What should you do?

Show Answer
Correct Answer: B

To de-identify the email field in both datasets before loading them into BigQuery for analyst use while ensuring that joins on the email field can still be performed, using format-preserving encryption, such as FFX with Cloud DLP, is the optimal solution. This method allows the email addresses to be consistently encrypted into unique values that can still be joined across datasets while keeping the actual email addresses hidden from analysts. This ensures both the technical requirement of performing joins and the compliance requirement of protecting PII are met.

Discussion

13 comments
Sign in to comment
lipa31Option: B
Jan 25, 2024

Format-preserving encryption (FPE) with FFX in Cloud DLP is a strong choice for de-identifying PII like email addresses. FPE maintains the format of the data and ensures that the same input results in the same encrypted output consistently. This means the email fields in both datasets can be encrypted to the same value, allowing for accurate joins in BigQuery while keeping the actual email addresses hidden.

Smakyel79Option: B
Jan 7, 2024

As it states "You need to de-identify the email field in both the datasets before loading them into BigQuery for analysts" data masking should not be an option as the data would stored unmasked in BigQuery?

task_7Option: B
Jan 11, 2024

A wouldn't preserve the email format C&D maskedReader roles still grant access to the underlying values. the only option is B

alfguemat
Jan 12, 2024

I dont't know why preserve email format is necessary to perform the join. A could be valid.

dduenas
Feb 6, 2024

masking only replace by specific characters, doing the field not unique and not ready for joins.

raaadOption: C
Jan 5, 2024

- The reason option C works well is that dynamic data masking in BigQuery allows the underlying data to remain unaltered (thus preserving the ability to join on this field), while also preventing analysts from viewing the actual PII. - The analysts can query and join the data as needed for their analysis, but when they access the data, the email field will be masked according to the policy tag, and they will only see the masked version.

Jordan18Option: B
Jan 6, 2024

why not B?

ML6Option: B
Feb 17, 2024

A) masking = replace with a surrogate character like # or * = output not unique, so cannot apply joins C and D) question specifies to de-identify BEFORE loading into BQ, whereas these options perform dynamic masking IN BigQuery. Therefore, only valid option is B.

JyoGCPOption: B
Feb 20, 2024

Option B https://cloud.google.com/sensitive-data-protection/docs/pseudonymization

Anudeep58Option: B
Jun 14, 2024

Option A: Masking: Simple masking might not preserve the uniqueness and joinability of the email field, making it difficult to perform accurate joins between datasets. Option C and D: Dynamic Data Masking: These options involve masking the email field dynamically within BigQuery, which does not address the requirement to de-identify data before loading into BigQuery. Additionally, dynamic masking does not prevent access to the actual email data before it is loaded into BigQuery, potentially exposing PII during the data ingestion process.

scaenruyOption: D
Jan 3, 2024

D. 1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking. 2. Create a policy tag with the default masking value as the data masking rule. 3. Assign the policy to the email field in both tables. 4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts

GCP001Option: C
Jan 9, 2024

data masking with BQ is correct with email masking rule. Ref - https://cloud.google.com/bigquery/docs/column-data-masking-intro

tibuenoc
Feb 7, 2024

should be correct if they want to access tables and it's not valid for datasets

Sofiia98Option: C
Jan 9, 2024

I will go for C, because there is a separate type of masking for emails, so whe to use the dafault? https://cloud.google.com/bigquery/docs/column-data-masking-intro#masking_options

Matt_108Option: C
Jan 13, 2024

Option C. The need is to just mask the data to Analyst, without modifying the underlying data. Moreover, it's stored on 2 separate tables and the analysts need to be able to perform joins based on the masked data. Dynamic masking is the right module and the right masking rule is email mask (https://cloud.google.com/bigquery/docs/column-data-masking-intro#masking_options) which guarantees the join capabilities join

chrissamharrisOption: B
May 16, 2024

format-preserving encryption with FFX is required as the analysts want to perform JOINs