Professional Machine Learning Engineer Exam - Question 48

Question

You started working on a classification problem with time series data and achieved an area under the receiver operating characteristic curve (AUC ROC) value of

99% for training data after just a few experiments. You haven't explored using any sophisticated algorithms or spent any time on hyperparameter tuning. What should your next step be to identify and fix the problem?

Examice · Accepted Answer

When you achieve an exceptionally high performance metric on your training data without using sophisticated algorithms or hyperparameter tuning, it often indicates a data leakage issue. Data leakage occurs when information that would not be available in a real-world scenario is inadvertently included in the training data, leading to unrealistically high model performance. In time series data specifically, it is crucial to ensure future information is not used to predict past events. One common cause of data leakage is the inclusion of features that are highly correlated with the target value because these features might indirectly contain information from the target variable itself. Therefore, addressing data leakage by removing features highly correlated with the target value is a necessary step to ensure the validity of your model's performance. This involves a careful examination of your features to determine if they are providing information that would not be available at the prediction time in a real-world scenario.

Paul_Dirac · Answer

Ans: B (Ref: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9)
(C) High correlation doesn't mean leakage. The question may suggest target leakage and the defining point of this leakage is the availability of data after the target is available.(https://www.kaggle.com/dansbecker/data-leakage)

John_Pongthorn · Answer

C: this is correct choice 1000000000%
This is data leakage issue on training data
https://cloud.google.com/automl-tables/docs/train#analyze
The question is from this content.
If a column's Correlation with Target value is high, make sure that is expected, and not an indication of target leakage.

Let 's explain on my owner way, sometime  the feature  used on training data use value to calculate something from  target value unintentionally,  it result in high correlation with each other. 
for instance , you predict stock price by using moving average, MACD , RSI  despite the fact that 3 features have been calculated from price (target).

hiromi · Answer

B
I agree with Paul_Dirac

tavva_prudhvi · Answer

Option C is a good step to avoid overfitting, but it's not necessarily the best approach to address data leakage.

Data leakage occurs when information from the validation or test data leaks into the training data, leading to overly optimistic performance metrics. In time-series data, it's important to avoid using future information to predict past events.

Removing features highly correlated with the target value may help to reduce overfitting, but it does not necessarily address data leakage.

Therefore, applying nested cross-validation during model training is a better approach to address data leakage in this scenario.

AnnaR · Answer

B: correct. 
considering c, but why should we remove a feature of highly predictive nature?? for me, this does not explain the problem of overfitting... a highly predictive feature is also useful for good performance evaluated on the test set. 
--> Decide for B!

M25 · Answer

Went with B

black_scissors · Answer

There can be a feature causing data leakage which might have been overlooked. In addition, when cross-validation is done randomly, the leakage can be even bigger.

Liting · Answer

Agree with Paul_Dirac. Also it is recommended to use nested-cross-validation to avoid data leakage in time series data.

Jarek7 · Answer

https://towardsdatascience.com/avoiding-data-leakage-in-timeseries-101-25ea13fcb15f
Directly says: "Dive straight into the MVP, cross-validate later!" 
MVP stands for Minimum Viable Product

atlas_lyon · Answer

B: If splits are done chronologically(as it is always advised), Nested CV should work
C: High correlation with target means we have to check if this is strong explanatory power or data leakage. dropping the features won't help us distinguish in those cases but may help reveal independence contribution of remaining features

pico · Answer

Option A: This option is a reasonable choice. Switching to a less complex algorithm can help reduce overfitting, and using k-fold cross-validation can provide a better estimate of how well the model will generalize to unseen data. It's essential to ensure that the high performance isn't solely due to overfitting.

Sum_Sum · Answer

I think its B. GPT4 makes a good argument about C:
 While this is a valid approach to handling data leakage, it might not be sufficient if the leakage is due to reasons other than high correlation, such as temporal leakage in time-series data.

b1a8fae · Answer

I initially went with B- however after reading this: https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ I think C is right. Quoted from the link: "Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.". Overfitting is exactly our problem here. Correlated features in the dataset may be a sign of data leakage, but they are not necessarily.

gscharly · Answer

agree with Paul_Dirac

girgu · Answer

Nested cross validation will not work for time series data. Time series data require the expanding widow training data set. Seems most likely the issue is high correlation in columns.

PhilipKoku · Answer

C) Is the best answer

chirag2506 · Answer

B is the correct option

Professional Machine Learning Engineer Exam - Question 48

Discussion