Professional Machine Learning Engineer Exam - Question 78

Question

You are creating a deep neural network classification model using a dataset with categorical input values. Certain columns have a cardinality greater than 10,000 unique values. How should you encode these categorical values as input into the model?

Examice · Accepted Answer

For categorical columns with high cardinality, traditional one-hot encoding is not efficient because it would create a very large number of binary columns. Instead, a better approach is to use one-hot hash buckets. This method hashes the categorical values into a fixed number of buckets, which helps manage the high cardinality by reducing the dimensionality while still providing a unique representation for each category. This encoding is computationally efficient and well-suited for large datasets.

hiromi · Answer

I think B is correct
Ref.:"
- https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost
- https://stackoverflow.com/questions/26473233/in-preprocessing-data-with-high-cardinality-do-you-hash-first-or-one-hot-encode

LearnSodas · Answer

Answer A since with 10.000 unique values one-hot shouldn't be a good solution
https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/

CloudKida · Answer

https://cloud.google.com/ai-platform/training/docs/algorithms/wide-and-deep
If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column.

MithunDesai · Answer

I think C as it has 10000 categorical values

John_Pongthorn · Answer

B unconditoinally
https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost#analysis

If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column.
A categorical column is considered to have high cardinality if the number of unique values is greater than the square root of the number of rows in the dataset.

M25 · Answer

Went with B

ares81 · Answer

It should be B

seifou · Answer

B is correct

JeanEl · Answer

Ans : B

mil_spyro · Answer

Answer is B. When cardinality of the categorical column is very large best choice is binary encoding however it not here hence one-hot hash option.

enghabeth · Answer

https://towardsdatascience.com/getting-deeper-into-categorical-encodings-for-machine-learning-2312acd347c8
When you have  millions uniques values try to do: Hash Encoding

JamesDoe · Answer

B.
The other options solves nada.

etienne0 · Answer

went with A

PhilipKoku · Answer

B) Hash buckets

Professional Machine Learning Engineer Exam - Question 78

Discussion