Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 48


You started working on a classification problem with time series data and achieved an area under the receiver operating characteristic curve (AUC ROC) value of

99% for training data after just a few experiments. You haven't explored using any sophisticated algorithms or spent any time on hyperparameter tuning. What should your next step be to identify and fix the problem?

Show Answer
Correct Answer: BC

When you achieve an exceptionally high performance metric on your training data without using sophisticated algorithms or hyperparameter tuning, it often indicates a data leakage issue. Data leakage occurs when information that would not be available in a real-world scenario is inadvertently included in the training data, leading to unrealistically high model performance. In time series data specifically, it is crucial to ensure future information is not used to predict past events. One common cause of data leakage is the inclusion of features that are highly correlated with the target value because these features might indirectly contain information from the target variable itself. Therefore, addressing data leakage by removing features highly correlated with the target value is a necessary step to ensure the validity of your model's performance. This involves a careful examination of your features to determine if they are providing information that would not be available at the prediction time in a real-world scenario.

Discussion

17 comments
Sign in to comment
Paul_DiracOption: B
Jun 27, 2021

Ans: B (Ref: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9) (C) High correlation doesn't mean leakage. The question may suggest target leakage and the defining point of this leakage is the availability of data after the target is available.(https://www.kaggle.com/dansbecker/data-leakage)

Jarek7
Jul 9, 2023

This ref doesn't explain WHY we should use NCV in this case - it just explains HOW to use NCV when dealing with time series. Cross-validation, including nested cross-validation, is a powerful tool for model evaluation and hyperparameter tuning, but it does NOT DIRECTLY ADDRESS data leakage. Data leakage refers to a situation where information from the test dataset leaks into the training dataset, causing the model to have an unrealistically high performance. Nested cross-validation can indeed help provide a more accurate estimation of the model's performance on unseen data, but IT DOESN'T SOLVE the underlying issue of data leakage if it's already present.

John_PongthornOption: C
Mar 6, 2023

C: this is correct choice 1000000000% This is data leakage issue on training data https://cloud.google.com/automl-tables/docs/train#analyze The question is from this content. If a column's Correlation with Target value is high, make sure that is expected, and not an indication of target leakage. Let 's explain on my owner way, sometime the feature used on training data use value to calculate something from target value unintentionally, it result in high correlation with each other. for instance , you predict stock price by using moving average, MACD , RSI despite the fact that 3 features have been calculated from price (target).

black_scissors
Jun 2, 2023

I agree. Besides, when a CV is done randomly (not split by the time point) it can make things worse.

hiromiOption: B
Dec 18, 2022

B I agree with Paul_Dirac

tavva_prudhviOption: B
Aug 8, 2023

Option C is a good step to avoid overfitting, but it's not necessarily the best approach to address data leakage. Data leakage occurs when information from the validation or test data leaks into the training data, leading to overly optimistic performance metrics. In time-series data, it's important to avoid using future information to predict past events. Removing features highly correlated with the target value may help to reduce overfitting, but it does not necessarily address data leakage. Therefore, applying nested cross-validation during model training is a better approach to address data leakage in this scenario.

AnnaROption: B
Apr 26, 2024

B: correct. considering c, but why should we remove a feature of highly predictive nature?? for me, this does not explain the problem of overfitting... a highly predictive feature is also useful for good performance evaluated on the test set. --> Decide for B!

M25Option: B
May 9, 2023

Went with B

black_scissorsOption: C
Jun 2, 2023

There can be a feature causing data leakage which might have been overlooked. In addition, when cross-validation is done randomly, the leakage can be even bigger.

LitingOption: B
Jul 7, 2023

Agree with Paul_Dirac. Also it is recommended to use nested-cross-validation to avoid data leakage in time series data.

Jarek7Option: C
Jul 9, 2023

https://towardsdatascience.com/avoiding-data-leakage-in-timeseries-101-25ea13fcb15f Directly says: "Dive straight into the MVP, cross-validate later!" MVP stands for Minimum Viable Product

atlas_lyonOption: B
Aug 24, 2023

B: If splits are done chronologically(as it is always advised), Nested CV should work C: High correlation with target means we have to check if this is strong explanatory power or data leakage. dropping the features won't help us distinguish in those cases but may help reveal independence contribution of remaining features

picoOption: A
Sep 13, 2023

Option A: This option is a reasonable choice. Switching to a less complex algorithm can help reduce overfitting, and using k-fold cross-validation can provide a better estimate of how well the model will generalize to unseen data. It's essential to ensure that the high performance isn't solely due to overfitting.

pico
Sep 13, 2023

Option B: Nested cross-validation is primarily used to estimate model performance accurately and select the best model hyperparameters. While it's a good practice, it doesn't directly address the overfitting issue. It helps prevent over-optimistic model performance estimates but doesn't necessarily fix the overfitting problem. Option C: Removing features highly correlated with the target value can be a valid step in feature selection or preprocessing. However, it doesn't directly address the overfitting issue or explain why the model is performing exceptionally well on the training data. It's a separate step from mitigating overfitting. Option D: This option is incorrect. Tuning hyperparameters should aim to improve model performance on the validation set, not reduce it. In summary, the most appropriate next step is Option A:

Sum_SumOption: B
Nov 15, 2023

I think its B. GPT4 makes a good argument about C: While this is a valid approach to handling data leakage, it might not be sufficient if the leakage is due to reasons other than high correlation, such as temporal leakage in time-series data.

b1a8faeOption: B
Dec 29, 2023

I initially went with B- however after reading this: https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ I think C is right. Quoted from the link: "Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.". Overfitting is exactly our problem here. Correlated features in the dataset may be a sign of data leakage, but they are not necessarily.

gscharlyOption: B
Apr 21, 2024

agree with Paul_Dirac

girguOption: C
May 26, 2024

Nested cross validation will not work for time series data. Time series data require the expanding widow training data set. Seems most likely the issue is high correlation in columns.

PhilipKokuOption: C
Jun 6, 2024

C) Is the best answer

chirag2506Option: B
Jun 25, 2024

B is the correct option