Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 53


Your team is working on an NLP research project to predict political affiliation of authors based on articles they have written. You have a large training dataset that is structured like this:

You followed the standard 80%-10%-10% data distribution across the training, testing, and evaluation subsets. How should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?

Show Answer
Correct Answer: B

To predict political affiliation of authors based on articles they have written, it's vital to avoid data leakage and ensure that the model generalizes well to unseen data. Distributing authors randomly across the train, test, and evaluation subsets is the appropriate approach. This strategy ensures that the model cannot simply memorize the style of certain authors and instead encourages the model to learn to associate the language in the texts themselves with political affiliation. By having distinct authors in each subset, the model's performance on new, unseen authors can be accurately evaluated, making the solution robust and reducing the chance of overfitting.

Discussion

16 comments
Sign in to comment
rc380Option: B
Aug 15, 2021

I think since we are predicting political leaning of authors, perhaps distributing authors make more sense? (B)

sensev
Aug 16, 2021

Agree it should be B. Since every author has his/her distinct style, splitting different text from the same author across different set could result in data label leakage.

dxxdd7
Sep 4, 2021

I don't agree as we want to know the political affiliation from a text and not based on an author. I think A is better

jk73
Sep 21, 2021

it is the political affiliation from a text, but to whom belong that text? The statement clearly says ... Predict political affiliation of authors based on articles they have written. Hence the political affiliation is for each author according to the text he wrote.

jk73
Sep 21, 2021

Exactly! I also consider is B Check this out! If we just put inside the Training set , Validation set and Test set , randomly Text, Paragraph or sentences the model will have the ability to learn specific qualities about The Author's use of language beyond just his own articles. Therefore the model will mixed up different opinions. Rather if we divided things up a the author level, so that given authors were only on the training data, or only in the test data or only in the validation data. The model will find more difficult to get a high accuracy on the test validation (What is correct and have more sense!). Because it will need to really focus in author by author articles rather than get a single political affiliation based on a bunch of mixed articles from different authors. https://developers.google.com/machine-learning/crash-course/18th-century-literature

inder0007Option: A
Jul 6, 2021

Should be A, we are trying to get a label on the entire text so only A makes sense

GogoG
Oct 11, 2021

Correct answer is B - https://developers.google.com/machine-learning/crash-course/18th-century-literature

Dunnoth
Feb 15, 2023

This is a known study. if you use A, the moment a new author is given in a test set the accuracy is waay low than what your metrics might suggest. To have realistic evaluation results it should be B. Also note that the label is for the "authour" not a text.

ggorzkiOption: B
Jan 19, 2022

https://developers.google.com/machine-learning/crash-course/18th-century-literature Split by authors, otherwise there will be data leakage - the model will get the ability to learn author specific use of language

MacgogoOption: B
Sep 18, 2021

I think it is B. -- Your test data includes data from populations that will not be represented in production. For example, suppose you are training a model with purchase data from a number of stores. You know, however, that the model will be used primarily to make predictions for stores that are not in the training data. To ensure that the model can generalize to unseen stores, you should segregate your data sets by stores. In other words, your test set should include only stores different from the evaluation set, and the evaluation set should include only stores different from the training set. https://cloud.google.com/automl-tables/docs/prepare#ml-use

tavva_prudhviOption: B
Jul 3, 2023

This is the best approach as it ensures that the data is distributed in a way that is representative of the overall population. By randomly distributing authors across the subsets, we ensure that each subset has a similar distribution of political affiliations. This helps to minimize bias and increases the likelihood that our model will generalize well to new data. Distributing texts randomly or by sentences or paragraphs may result in subsets that are biased towards a particular political affiliation. This could lead to overfitting and poor generalization performance. Therefore, it is important to distribute the data in a way that maintains the overall distribution of political affiliations across the subsets.

JobQOption: A
Dec 20, 2021

I already saw the video in: https://developers.google.com/machine-learning/crash-course/18th-century-literature Based on this video I concluded that the answer is A. What answer B is saying is that you will have Author B's texts in the training set, Author A's texts in the testing set and Author C's texts in the validation set. According to the video B is incorrect. We want to have texts from author A in the training, testing and validation set. So A is correct. I think most people are choosing B because the word "author" but let's be careful.

giaZ
Mar 9, 2022

I though the same initially, but no..We'd want texts from author A in the training, testing and validation set if the task was to predict the author from a text (meaning, if the label was the author..right? You train the model to learn the style of text and connect it to an author. You'd need new texts from the same author in the test and validation sets, to see if the model is able to recognize him/her). HERE, the task is to predict political affiliation from a text of an author. The author is given. In the test and validation sets you need new authors, to see wether the model is able to guess their political affiliation. So you would do 80 authors (and corresponding texts) for training, 10 different authors for validation, and 10 different ones for test.

Danny2021Option: D
Sep 9, 2021

Should be D. Please see the dataset provided, it is based on the text / paragraphs.

george_ognyanov
Oct 14, 2021

Have a look at the link the other have already provided twice. Splitting sentence by sentence is literally mentioned in said video as a bad example and something we should not do in this case.

pdddddOption: B
Sep 27, 2021

Partition by author - there is an actual example in Coursera 'Production ML systems' course

NamitSehgalOption: B
Jan 4, 2022

B I agree

suresh_vnOption: B
Aug 11, 2022

IMO, B is correct A,C,D label leakaged

bL357AOption: A
Sep 5, 2022

label is party, feature is text

enghabethOption: B
Feb 8, 2023

Ans B The model is to predict which political party the author belongs to, not which political party the text belongs to... You do not have the information of the political party of each text, you are assuming that the texts are associated with the political party of the author.

John_PongthornOption: B
Feb 16, 2023

https://cloud.google.com/automl-tables/docs/prepare#split https://developers.google.com/machine-learning/crash-course/18th-century-literature

M25Option: B
May 9, 2023

Went with B

girguOption: B
May 26, 2024

We have divide / split at author level. Other wise model will used text to author relationship but we want to find text to political affiliation relation ship. While prediction we already know text to author relation but we want to find text to political relation (and therefore author to political relation is implied.

PhilipKokuOption: B
Jun 6, 2024

B) Authors