While conducting an exploratory analysis of a dataset, you discover that categorical feature A has substantial predictive power, but it is sometimes missing. What should you do?
While conducting an exploratory analysis of a dataset, you discover that categorical feature A has substantial predictive power, but it is sometimes missing. What should you do?
When dealing with a categorical feature that has substantial predictive power but contains missing values, the best approach is to add an additional class to the categorical feature for the missing values and create a new binary feature indicating whether the original feature is missing. This approach has several advantages: it preserves the predictive power of the original feature without discarding any valuable information, it explicitly captures the absence of data which can be informative, and it allows the model to recognize when and how to leverage the original feature depending on its availability. Simply dropping the feature or replacing missing values with the mode could either lose valuable information or introduce bias. Creating a binary indicator for missing values provides the model with flexibility to learn the appropriate use of the feature.
ans: D A => no, you don't want to drop a feature with high prediction power. B => i think this could confuse the model... a better solution could be to fill missing values using an algorithm like Expectation Maximization, but using the mode i think is a bad idea in this case, because if you have a significant number of missing values (for example >10%) this would modify the "predictive power". you don't want to lose predictive power of a feature, just guide the model to learn when to use that feature and when to ignore it. C => this doesn't make any sense for me. not sure what i would do that. D => i think this could be a really good approach, and i'm pretty sure it would work pretty well a lot of models. the model would learn that when "is_available_feat_A" == True, then it would use the feature A, but whenever it is missing then it would try to use other features.
I guess I would go with D, but it confuses me the fact that in option D, it doesn't say that NaN values are replaced (only that there's a new column added) and this could lead to problems like exploding gradients. Plus, Google encourages to replace missing values. https://developers.google.com/machine-learning/testing-debugging/common/data-errors Any thoughts on this?
B "For categorical variables, we can usually replace missing values with mean, median, or most frequent values" Dr. Logan Song - Journey to Become a Google Cloud Machine Learning Engineer - Page 48
While this approach may seem reasonable, it can introduce bias in the dataset by over-representing the mode, especially if the missing values are not missing at random.
Mode is the way to go for categorical features. B, for me.
B. Because the important feature is already known. By using mode, contribution of other features will not be missed
I think, its D. Option B of imputing the missing values of feature A with the mode of feature A could be a reasonable approach if the mode provides a good representation of the distribution of feature A. However, this method may lead to biased results if the mode is not representative of the missing values. This could be the case if the missing values have a different distribution than the observed values. Similarly, When a categorical feature has substantial predictive power, it is important not to discard it. Instead, missing values can be handled by adding an additional class for missing values and creating a new binary feature that indicates whether feature A is missing or not. This approach ensures that the predictive power of feature A is retained while accounting for missing values. Computing the mode of feature A and replacing missing values may distort the distribution of the feature and create bias in the analysis. Similarly, replacing missing values with values from another feature may introduce noise and lead to incorrect results.
Both B and D are possible, but the correct answer is D because of the feature high predictive power.
By creating a new class for the missing values, you explicitly capture the absence of data, which can provide valuable information for predictive modeling. Additionally, creating a binary feature allows the model to distinguish between cases where feature A is present and cases where it is missing, which can be useful for identifying potential patterns or relationships in the data.
Definitely not D, it does not even solve the problem of NA values.
If our objective was to produce a complete dataset then we might use some average value to fill in the gaps (option B) but in this case we want to predict an outcome, so inventing our own data is not going to help in my view. Option D is the most sensible approach to let the model choose the best features.
Went with D
By imputing the missing values with the mode (the most frequent value), you retain the original feature's predictive power while handling the missing values
Agree with wish0035, answer should be D
Options B or D But isnt there an inconsistency in option D? if you replace missing values with a new category ("missing") why would you haveto create an extra feature?
highly predictive
B and D are correct, but I decided to go with D.
Google encourages filling missing value and using mode is one of the examples given. D only tell the obvious - data is missing!
D) Good approach