Professional Machine Learning Engineer Exam - Question 53

Question

Your team is working on an NLP research project to predict political affiliation of authors based on articles they have written. You have a large training dataset that is structured like this:

You followed the standard 80%-10%-10% data distribution across the training, testing, and evaluation subsets. How should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?

Examice · Accepted Answer

To predict political affiliation of authors based on articles they have written, it's vital to avoid data leakage and ensure that the model generalizes well to unseen data. Distributing authors randomly across the train, test, and evaluation subsets is the appropriate approach. This strategy ensures that the model cannot simply memorize the style of certain authors and instead encourages the model to learn to associate the language in the texts themselves with political affiliation. By having distinct authors in each subset, the model's performance on new, unseen authors can be accurately evaluated, making the solution robust and reducing the chance of overfitting.

rc380 · Answer

I think since we are predicting political leaning of authors, perhaps distributing authors make more sense? (B)

inder0007 · Answer

Should be A, we are trying to get a label on the entire text so only A makes sense

ggorzki · Answer

https://developers.google.com/machine-learning/crash-course/18th-century-literature
Split by authors, otherwise there will be data leakage - the model will get the ability to learn author specific use of language

Macgogo · Answer

I think it is B.
--
Your test data includes data from populations that will not be represented in production.

For example, suppose you are training a model with purchase data from a number of stores. You know, however, that the model will be used primarily to make predictions for stores that are not in the training data. To ensure that the model can generalize to unseen stores, you should segregate your data sets by stores. In other words, your test set should include only stores different from the evaluation set, and the evaluation set should include only stores different from the training set.
https://cloud.google.com/automl-tables/docs/prepare#ml-use

tavva_prudhvi · Answer

This is the best approach as it ensures that the data is distributed in a way that is representative of the overall population. By randomly distributing authors across the subsets, we ensure that each subset has a similar distribution of political affiliations. This helps to minimize bias and increases the likelihood that our model will generalize well to new data.

Distributing texts randomly or by sentences or paragraphs may result in subsets that are biased towards a particular political affiliation. This could lead to overfitting and poor generalization performance. Therefore, it is important to distribute the data in a way that maintains the overall distribution of political affiliations across the subsets.

JobQ · Answer

I already saw the video in: https://developers.google.com/machine-learning/crash-course/18th-century-literature

Based on this video I concluded that the answer is A. What answer B is saying is that you will have Author B's texts in the training set, Author A's texts in the testing set and Author C's texts in the validation set. According to the video B is incorrect.

We want to have texts from author A in the training, testing and validation set. So A is correct. I think most people are choosing B because the word "author" but let's be careful.

Danny2021 · Answer

Should be D. Please see the dataset provided, it is based on the text / paragraphs.

pddddd · Answer

Partition by author - there is an actual example in Coursera 'Production ML systems' course

NamitSehgal · Answer

B I agree

suresh_vn · Answer

IMO, B is correct
A,C,D label leakaged

bL357A · Answer

label is party, feature is text

enghabeth · Answer

Ans B
The model is to predict which political party the author belongs to, not which political party the text belongs to... You do not have the information of the political party of each text, you are assuming that the texts are associated with the political party of the author.

John_Pongthorn · Answer

https://cloud.google.com/automl-tables/docs/prepare#split
https://developers.google.com/machine-learning/crash-course/18th-century-literature

M25 · Answer

Went with B

girgu · Answer

We have divide / split at author level. Other wise model will used text to author relationship but we want to find text to political affiliation relation ship. While prediction we already know text to author relation but we want to find text to political relation (and therefore author to political relation is implied.

PhilipKoku · Answer

B) Authors

Professional Machine Learning Engineer Exam - Question 53

Discussion