Professional Machine Learning Engineer Exam - Question 36

Question

You are building a model to predict daily temperatures. You split the data randomly and then transformed the training and test datasets. Temperature data for model training is uploaded hourly. During testing, your model performed with 97% accuracy; however, after deploying to production, the model's accuracy dropped to 66%. How can you make your production model more accurate?

Examice · Accepted Answer

In time series prediction, it is crucial to ensure that the training and test sets do not involve data leakage, which occurs when information from the future is used in training. A random split can introduce leakage because future data points could end up in the training set, which artificially inflates the accuracy. Splitting the data based on time prevents this by ensuring that only past data is used to predict future data, leading to a more realistic evaluation of the model’s performance. Therefore, splitting the training and test data based on time would make the production model more accurate.

maartenalexander · Answer

B. If you do time series prediction, you can't borrow information from the future to predict the future. If you do, you are artificially increasing your accuracy.

Danny2021 · Answer

B. D doesn't improve anything at all. Split and Transform is no different than Transform and Split if the transform logic is the same.

giaZ · Answer

If you do random split in a time series, your risk that training data will contain information about the target (definition of leakage), but similar data won't be available when the model is used for prediction. Leakage causes the model to look accurate until you start making actual predictions with it.

David_ml · Answer

You don't split data randomly for time series prediction.

JobQ · Answer

I think is B

xiaoF · Answer

agree B as well

mmona19 · Answer

B should be the answer. D is incorrect as normalize before split is going to do data leak https://community.rapidminer.com/discussion/32592/normalising-data-before-data-split-or-after

Mohamed_Mossad · Answer

train accuracy 97% , production accuracy 66% ---> time series data ---> random split ---> cause leakage , answer is B

SergioRubiano · Answer

D is correct. cross-validate

Jijiji · Answer

seems like D

M25 · Answer

Went with B

Sum_Sum · Answer

they did not explicitly say forecasting, but splitting by time is the number one rule you learn

fragkris · Answer

Definetely B

PhilipKoku · Answer

B) Time split to avoid leaking data.

Professional Machine Learning Engineer Exam - Question 36

Discussion