Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 243


You are preparing data that your machine learning team will use to train a model using BigQueryML. They want to predict the price per square foot of real estate. The training data has a column for the price and a column for the number of square feet. Another feature column called ‘feature1’ contains null values due to missing data. You want to replace the nulls with zeros to keep more data points. Which query should you use?

Show Answer
Correct Answer: AD

To replace null values with zeros in the 'feature1' column while retaining the other columns in the dataset, you should use a query that specifically addresses the null values in ‘feature1’ without removing any other important columns. This makes the data ready for machine learning without losing any crucial information. The query in option A does this by selecting all columns except ‘feature1’ and creating a new column ‘feature1_cleaned’ with null values replaced by zeros. This way, you retain all necessary data for predicting the price per square foot and avoid performing unnecessary calculations or exclusions.

Discussion

12 comments
Sign in to comment
oleg25
Feb 17, 2024

I didn't get why they mentioned in the task price and square feet columns. Just to irritate us? Do we need to do something with these columns or just with column feature1?

d11379b
Mar 24, 2024

I think they just want us to build a “label” (target) column ourselves since there’s no direct value in the training set

d11379b
Mar 24, 2024

But I still prefer to choose A since the square_feet column itself may have influence on price, which shouldn’t be removed

raaadOption: A
Jan 4, 2024

Straight forward

datapassionateOption: C
Jan 16, 2024

Correct answer is C. It both replace NULL with 0 and pass price per square foot of real estate.

George_Zhu
Feb 13, 2024

Option C isn't a good practice. What if any 0 value is contained in the column of squre_feet, then price / 0 will throw an exception. IF(IFNULL(squre_feet, 0) = 0, 0, price/squre_feet).

52ed0e5Option: A
Mar 11, 2024

Option A is the correct choice because it retains all the original columns and specifically addresses the issue of null values in ‘feature1’ by replacing them with zeros, without altering any other columns or performing unnecessary calculations. This makes the data ready for use in BigQueryML without losing any important information. Option C is not the best choice because it includes the EXCEPT clause for the price and square_feet columns, which would exclude these columns from the results. This is not desirable since you need these columns for the machine learning model to predict the price per square foot

JyoGCPOption: A
Feb 20, 2024

Option A

demoro86Option: A
Feb 28, 2024

C is not a valid answer. You are introducing a redundant variable, that could be valid, but removing from the dataset 2 variables that exactly influence in the predictions you are trying to make.

demoro86Option: A
Feb 28, 2024

C is not a valid answer. You are introducing a redundant variable, that could be valid, but removing from the dataset 2 variables that exactly influence in the predictions you are trying to make.

Matt_108Option: A
Jan 13, 2024

option A clearly

cuadradobertolinisebastiancamiOption: C
Feb 23, 2024

It should be C. "They want to predict the price per square foot of real estate. The training data has a column for the price and a column for the number of square feet." You need to create the column the model is going to predict.

PetrSzOption: C
Feb 25, 2024

Option C not only handles the null values in feature1 by replacing them with zeros (using IFNULL(feature1, 0) as feature1_cleaned), but it also creates a new feature price_per_sqft by dividing the price by the number of square feet (price/square_feet as price_per_sqft). This new feature directly corresponds to what your team wants to predict (the price per square foot of real estate), and could therefore be very useful for the machine learning model.

srinidutt
Apr 29, 2024

EXCEPT means it won't select that column.

47767f9Option: C
Jul 3, 2024

Font Cloude 3.5 and GPT 4o, in theoy is better to keep the less amount of features, then price_per_sqft and feature1 cleaned is the best option