Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 78


You are creating a deep neural network classification model using a dataset with categorical input values. Certain columns have a cardinality greater than 10,000 unique values. How should you encode these categorical values as input into the model?

Show Answer
Correct Answer: B

For categorical columns with high cardinality, traditional one-hot encoding is not efficient because it would create a very large number of binary columns. Instead, a better approach is to use one-hot hash buckets. This method hashes the categorical values into a fixed number of buckets, which helps manage the high cardinality by reducing the dimensionality while still providing a unique representation for each category. This encoding is computationally efficient and well-suited for large datasets.

Discussion

14 comments
Sign in to comment
hiromiOption: B
Dec 18, 2022

I think B is correct Ref.:" - https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost - https://stackoverflow.com/questions/26473233/in-preprocessing-data-with-high-cardinality-do-you-hash-first-or-one-hot-encode

hiromi
Dec 23, 2022

- https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost#analysis

LearnSodasOption: A
Dec 11, 2022

Answer A since with 10.000 unique values one-hot shouldn't be a good solution https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/

etienne0
Mar 2, 2024

I agree with A

CloudKidaOption: B
May 8, 2023

https://cloud.google.com/ai-platform/training/docs/algorithms/wide-and-deep If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column.

MithunDesaiOption: C
Dec 22, 2022

I think C as it has 10000 categorical values

John_PongthornOption: B
Jan 26, 2023

B unconditoinally https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost#analysis If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column. A categorical column is considered to have high cardinality if the number of unique values is greater than the square root of the number of rows in the dataset.

M25Option: B
May 9, 2023

Went with B

ares81Option: B
Dec 11, 2022

It should be B

seifouOption: B
Dec 11, 2022

B is correct

JeanElOption: B
Dec 13, 2022

Ans : B

mil_spyroOption: B
Dec 17, 2022

Answer is B. When cardinality of the categorical column is very large best choice is binary encoding however it not here hence one-hot hash option.

mil_spyro
Dec 17, 2022

https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

enghabethOption: B
Feb 9, 2023

https://towardsdatascience.com/getting-deeper-into-categorical-encodings-for-machine-learning-2312acd347c8 When you have millions uniques values try to do: Hash Encoding

JamesDoeOption: B
Mar 28, 2023

B. The other options solves nada.

etienne0Option: A
Mar 2, 2024

went with A

PhilipKokuOption: B
Jun 7, 2024

B) Hash buckets