MLS-C01 Exam - Question 34

Question

A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.

Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population

How should the Data Scientist correct this issue?

Examice · Accepted Answer

The best approach to handle the issue of invalid age entries is to replace those entries with the mean or median value from the dataset. This method ensures that the dataset size remains intact and helps maintain the distribution of the age feature. Dropping the records would result in a loss of valuable data, and dropping the age feature altogether could negatively impact the model's performance since age is an important factor. Using k-means clustering to handle missing features is not appropriate in this situation as it is generally used for grouping data rather than imputing missing values.

rajs · Answer

Dropping the Age feature is a NOT ATOLL a good idea - as age plays a critical role in this disease as per the question

Dropping 10% of data is NOT a good idea considering the fact that the number of observations is already low.

The Mean or Median are a potential solutions

But the question says that "Disease worsens after age 65 so there is a correlation between age and other symptoms related feature" So that means that using Unsupervised Learning we can make pretty good prediction of "Age"

So the answer is D Use K-Means clustering

vetal · Answer

Replacing the age with mean or median might bring a bias to the dataset. Use k-means clustering to estimate the missing age based on other features might get better results. Removing 10% available data looks odd. Why not D?

geoan13 · Answer

B is correct.K-means is unsupervised and used mainly for clustering. KNN would have been more accurate. It can be used to predict a value. since knn is not present i think it is mean median value

jyrajan69 · Answer

How can it be when there is a labelled outcome, which means this is Supervised and K-Means is for UnSupervised. So only possible answer should be B

elvin_ml_qayiran25091992razor · Answer

B is correct or KNN, but dont K means

endeesa · Answer

Obviously B, why would you use a clustering algorithm to predict a value? D just doesn't make sense

nilmans · Answer

B is correct, K-NN could have helped instead of k-means

3eb0542 · Answer

Using k-means clustering to handle missing features is not directly applicable to this scenario. K-means clustering is a method for grouping data points into clusters based on similarity, and it's not typically used for imputing missing values.

kaike_reis · Answer

Both A and B are correct. But, I noted that at the end of the question is mentioned that all other features are OKAY, so is reasonable to do this simple imputation.

FloKo · Answer

k-means should give the best estimation of the age. Using mean would reduce the correlation between outcome and age for the model.

loict · Answer

A. NO - unless we want to loose 10% of the data
B. NO - age is predictive, so using the means we would introduce a bias
C. NO - age is predictive 
D. YES - better quality than B, it is likely that other physiological values can help predict the age

Topg4u · Answer

mean or median is for outliers so D

kyuhuck · Answer

The best way to handle the missing values in the patient age feature is to replace them with the
mean or median value from the dataset. This is a common technique for imputing missing values that
preserves the overall distribution of the data and avoids introducing bias or reducing the sample size.
Dropping the records or the feature would result in losing valuable information and reducing the
accuracy of the model. Using k-means clustering would not be appropriate for handling missing
values in a single feature, as it is a method for grouping similar data points based on multiple

kyuhuck · Answer

add/ comment why? b ? - >replacing the age field value for records with a value of 0 with the mean or median value from the dataset, is generally the best approach among the given options. It allows the preservation of the dataset size and leverages the remaining correct data points, assuming age is a crucial predictor in this context. However, it's vital to perform this imputation carefully to avoid introducing bias. Median is often preferred in this scenario to mitigate the impact of outliers.

rookiee1111 · Answer

The question tries to mislead by adding information around the feature correlation. K-means clustering is not meant for imputing data. Hence answer should be B, that would be the right way of handling the missing value.

pn12345 · Answer

B-chatgpt

imymoco · Answer

B. Replace the age field value for records with a value of 0 with the mean or median value from the dataset: This method allows for retaining all patient records while addressing the anomaly. It is a standard approach for dealing with missing or incorrect values in a way that preserves the integrity of the dataset.

B. GPT answer

MLS-C01 Exam - Question 34

Discussion