Exam MLS-C01 All QuestionsBrowse all questions from this exam
Question 34

A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.

Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population

How should the Data Scientist correct this issue?

    Correct Answer: B

    The best approach to handle the issue of invalid age entries is to replace those entries with the mean or median value from the dataset. This method ensures that the dataset size remains intact and helps maintain the distribution of the age feature. Dropping the records would result in a loss of valuable data, and dropping the age feature altogether could negatively impact the model's performance since age is an important factor. Using k-means clustering to handle missing features is not appropriate in this situation as it is generally used for grouping data rather than imputing missing values.

Discussion
rajs

Dropping the Age feature is a NOT ATOLL a good idea - as age plays a critical role in this disease as per the question Dropping 10% of data is NOT a good idea considering the fact that the number of observations is already low. The Mean or Median are a potential solutions But the question says that "Disease worsens after age 65 so there is a correlation between age and other symptoms related feature" So that means that using Unsupervised Learning we can make pretty good prediction of "Age" So the answer is D Use K-Means clustering

L2007

https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/ B is correct

vetal

Replacing the age with mean or median might bring a bias to the dataset. Use k-means clustering to estimate the missing age based on other features might get better results. Removing 10% available data looks odd. Why not D?

geoan13

B is correct.K-means is unsupervised and used mainly for clustering. KNN would have been more accurate. It can be used to predict a value. since knn is not present i think it is mean median value

endeesaOption: B

Obviously B, why would you use a clustering algorithm to predict a value? D just doesn't make sense

elvin_ml_qayiran25091992razorOption: B

B is correct or KNN, but dont K means

jyrajan69

How can it be when there is a labelled outcome, which means this is Supervised and K-Means is for UnSupervised. So only possible answer should be B

3eb0542Option: B

Using k-means clustering to handle missing features is not directly applicable to this scenario. K-means clustering is a method for grouping data points into clusters based on similarity, and it's not typically used for imputing missing values.

nilmans

B is correct, K-NN could have helped instead of k-means

imymoco

B. Replace the age field value for records with a value of 0 with the mean or median value from the dataset: This method allows for retaining all patient records while addressing the anomaly. It is a standard approach for dealing with missing or incorrect values in a way that preserves the integrity of the dataset. B. GPT answer

pn12345

B-chatgpt

rookiee1111

The question tries to mislead by adding information around the feature correlation. K-means clustering is not meant for imputing data. Hence answer should be B, that would be the right way of handling the missing value.

kyuhuckOption: B

add/ comment why? b ? - >replacing the age field value for records with a value of 0 with the mean or median value from the dataset, is generally the best approach among the given options. It allows the preservation of the dataset size and leverages the remaining correct data points, assuming age is a crucial predictor in this context. However, it's vital to perform this imputation carefully to avoid introducing bias. Median is often preferred in this scenario to mitigate the impact of outliers.

kyuhuckOption: B

The best way to handle the missing values in the patient age feature is to replace them with the mean or median value from the dataset. This is a common technique for imputing missing values that preserves the overall distribution of the data and avoids introducing bias or reducing the sample size. Dropping the records or the feature would result in losing valuable information and reducing the accuracy of the model. Using k-means clustering would not be appropriate for handling missing values in a single feature, as it is a method for grouping similar data points based on multiple

Topg4u

mean or median is for outliers so D

loictOption: D

A. NO - unless we want to loose 10% of the data B. NO - age is predictive, so using the means we would introduce a bias C. NO - age is predictive D. YES - better quality than B, it is likely that other physiological values can help predict the age

FloKoOption: D

k-means should give the best estimation of the age. Using mean would reduce the correlation between outcome and age for the model.

kaike_reisOption: B

Both A and B are correct. But, I noted that at the end of the question is mentioned that all other features are OKAY, so is reasonable to do this simple imputation.