Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 36


You are building a model to predict daily temperatures. You split the data randomly and then transformed the training and test datasets. Temperature data for model training is uploaded hourly. During testing, your model performed with 97% accuracy; however, after deploying to production, the model's accuracy dropped to 66%. How can you make your production model more accurate?

Show Answer
Correct Answer: B

In time series prediction, it is crucial to ensure that the training and test sets do not involve data leakage, which occurs when information from the future is used in training. A random split can introduce leakage because future data points could end up in the training set, which artificially inflates the accuracy. Splitting the data based on time prevents this by ensuring that only past data is used to predict future data, leading to a more realistic evaluation of the model’s performance. Therefore, splitting the training and test data based on time would make the production model more accurate.

Discussion

14 comments
Sign in to comment
maartenalexanderOption: B
Jun 22, 2021

B. If you do time series prediction, you can't borrow information from the future to predict the future. If you do, you are artificially increasing your accuracy.

Danny2021Option: B
Sep 9, 2021

B. D doesn't improve anything at all. Split and Transform is no different than Transform and Split if the transform logic is the same.

giaZOption: B
Mar 8, 2022

If you do random split in a time series, your risk that training data will contain information about the target (definition of leakage), but similar data won't be available when the model is used for prediction. Leakage causes the model to look accurate until you start making actual predictions with it.

David_mlOption: B
May 9, 2022

You don't split data randomly for time series prediction.

JobQOption: B
Dec 20, 2021

I think is B

xiaoFOption: B
Feb 1, 2022

agree B as well

mmona19Option: B
Apr 14, 2022

B should be the answer. D is incorrect as normalize before split is going to do data leak https://community.rapidminer.com/discussion/32592/normalising-data-before-data-split-or-after

Mohamed_MossadOption: B
Jun 13, 2022

train accuracy 97% , production accuracy 66% ---> time series data ---> random split ---> cause leakage , answer is B

SergioRubianoOption: D
Mar 24, 2023

D is correct. cross-validate

JijijiOption: D
Aug 26, 2021

seems like D

M25Option: B
May 9, 2023

Went with B

Sum_SumOption: B
Nov 15, 2023

they did not explicitly say forecasting, but splitting by time is the number one rule you learn

fragkrisOption: B
Dec 5, 2023

Definetely B

PhilipKokuOption: B
Jun 6, 2024

B) Time split to avoid leaking data.