You are creating a deep neural network classification model using a dataset with categorical input values. Certain columns have a cardinality greater than 10,000 unique values. How should you encode these categorical values as input into the model?
You are creating a deep neural network classification model using a dataset with categorical input values. Certain columns have a cardinality greater than 10,000 unique values. How should you encode these categorical values as input into the model?
For categorical columns with high cardinality, traditional one-hot encoding is not efficient because it would create a very large number of binary columns. Instead, a better approach is to use one-hot hash buckets. This method hashes the categorical values into a fixed number of buckets, which helps manage the high cardinality by reducing the dimensionality while still providing a unique representation for each category. This encoding is computationally efficient and well-suited for large datasets.
I think B is correct Ref.:" - https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost - https://stackoverflow.com/questions/26473233/in-preprocessing-data-with-high-cardinality-do-you-hash-first-or-one-hot-encode
- https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost#analysis
Answer A since with 10.000 unique values one-hot shouldn't be a good solution https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/
I agree with A
https://cloud.google.com/ai-platform/training/docs/algorithms/wide-and-deep If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column.
I think C as it has 10000 categorical values
B unconditoinally https://cloud.google.com/ai-platform/training/docs/algorithms/xgboost#analysis If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column. A categorical column is considered to have high cardinality if the number of unique values is greater than the square root of the number of rows in the dataset.
Went with B
It should be B
B is correct
Ans : B
Answer is B. When cardinality of the categorical column is very large best choice is binary encoding however it not here hence one-hot hash option.
https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/
https://towardsdatascience.com/getting-deeper-into-categorical-encodings-for-machine-learning-2312acd347c8 When you have millions uniques values try to do: Hash Encoding
B. The other options solves nada.
went with A
B) Hash buckets