Associate Data Practitioner Exam QuestionsBrowse all questions from this exam

Associate Data Practitioner Exam - Question 18


You are working with a large dataset of customer reviews stored in Cloud Storage. The dataset contains several inconsistencies, such as missing values, incorrect data types, and duplicate entries. You need to clean the data to ensure that it is accurate and consistent before using it for analysis. What should you do?

Show Answer
Correct Answer:

Discussion

2 comments
Sign in to comment
SaquibHermanOption: A
Feb 18, 2025

PythonOperator allows leveraging Python libraries (e.g., Pandas, PySpark) to perform robust data cleaning tasks: Handle missing values (e.g., imputation, filtering). Fix incorrect data types (e.g., string-to-date conversions). Remove duplicates (e.g., using deduplication logic).

n2183712847Option: B
Feb 27, 2025

The best option is B. Use BigQuery to batch load the data into BigQuery and use SQL for cleaning and analysis. Loading directly into BigQuery and using SQL provides the optimal balance of efficiency and simplicity for cleaning large datasets before analysis by leveraging BigQuery's scalable processing for both loading and transformation. Option A (Cloud Composer + PythonOperator) adds unnecessary complexity of workflow orchestration and external processing before loading, reducing efficiency. Option C (Storage Transfer Service + Cloud Run) overcomplicates the process with extra data movement and event-driven functions, making it less direct for data cleaning. Option D (Cloud Run functions) is less efficient for large-scale data cleaning compared to BigQuery SQL's parallel processing and adds complexity before data is in BigQuery for analysis. Therefore, loading into BigQuery and using SQL is the most efficient and straightforward approach for cleaning data before analysis in this scenario.