Professional Data Engineer Exam - Question 250

Question

Your company's data platform ingests CSV file dumps of booking and user profile data from upstream sources into Cloud Storage. The data analyst team wants to join these datasets on the email field available in both the datasets to perform analysis. However, personally identifiable information (PII) should not be accessible to the analysts. You need to de-identify the email field in both the datasets before loading them into BigQuery for analysts. What should you do?

Examice · Accepted Answer

To de-identify the email field in both datasets before loading them into BigQuery for analyst use while ensuring that joins on the email field can still be performed, using format-preserving encryption, such as FFX with Cloud DLP, is the optimal solution. This method allows the email addresses to be consistently encrypted into unique values that can still be joined across datasets while keeping the actual email addresses hidden from analysts. This ensures both the technical requirement of performing joins and the compliance requirement of protecting PII are met.

lipa31 · Answer

Format-preserving encryption (FPE) with FFX in Cloud DLP is a strong choice for de-identifying PII like email addresses. FPE maintains the format of the data and ensures that the same input results in the same encrypted output consistently. This means the email fields in both datasets can be encrypted to the same value, allowing for accurate joins in BigQuery while keeping the actual email addresses hidden.

Smakyel79 · Answer

As it states "You need to de-identify the email field in both the datasets before loading them into BigQuery for analysts" data masking should not be an option as the data would stored unmasked in BigQuery?

task_7 · Answer

A  wouldn't preserve the email format
C&D maskedReader roles still grant access to the underlying values.
the only option is B

raaad · Answer

- The reason option C works well is that dynamic data masking in BigQuery allows the underlying data to remain unaltered (thus preserving the ability to join on this field), while also preventing analysts from viewing the actual PII. 
- The analysts can query and join the data as needed for their analysis, but when they access the data, the email field will be masked according to the policy tag, and they will only see the masked version.

Jordan18 · Answer

why not B?

ML6 · Answer

A) masking = replace with a surrogate character like # or * = output not unique, so cannot apply joins
C and D) question specifies to de-identify BEFORE loading into BQ, whereas these options perform dynamic masking IN BigQuery.

Therefore, only valid option is B.

JyoGCP · Answer

Option B
https://cloud.google.com/sensitive-data-protection/docs/pseudonymization

Anudeep58 · Answer

Option A:
Masking: Simple masking might not preserve the uniqueness and joinability of the email field, making it difficult to perform accurate joins between datasets.
Option C and D:
Dynamic Data Masking: These options involve masking the email field dynamically within BigQuery, which does not address the requirement to de-identify data before loading into BigQuery. Additionally, dynamic masking does not prevent access to the actual email data before it is loaded into BigQuery, potentially exposing PII during the data ingestion process.

scaenruy · Answer

D. 1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking.
2. Create a policy tag with the default masking value as the data masking rule.
3. Assign the policy to the email field in both tables.
4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts

GCP001 · Answer

data masking with BQ is correct with email masking rule.
Ref - https://cloud.google.com/bigquery/docs/column-data-masking-intro

Sofiia98 · Answer

I will go for C, because there is a separate type of masking for emails, so whe to use the dafault? https://cloud.google.com/bigquery/docs/column-data-masking-intro#masking_options

Matt_108 · Answer

Option C. The need is to just mask the data to Analyst, without modifying the underlying data. Moreover, it's stored on 2 separate tables and the analysts need to be able to perform joins based on the masked data. Dynamic masking is the right module and the right masking rule is email mask (https://cloud.google.com/bigquery/docs/column-data-masking-intro#masking_options) which guarantees the join capabilities join

chrissamharris · Answer

format-preserving encryption with FFX is required as the analysts want to perform JOINs

Professional Data Engineer Exam - Question 250

Discussion