Aggregating target-encoded array-like categorical features?Ground-truth and feature extraction for predictive modellingDo categorical features always need to be encoded?Suitable aggregations (mean, median or something else) to make features?Preparing, Scaling and Selecting from a combination of numerical and categorical featuresCatboost Categorical Features Handling Options (CTR settings)?Removing Categorial Features in Linear RegressionOne hot encoding vs Word embeddingOne Hot Encoding vs Word Embeding - When to choose one or another?How to handle continuous values and a binary target?Target Encoding: missing value imputation before or after encoding
What was the state of the German rail system in 1944?
Is it cheaper to drop cargo than to land it?
Which industry am I working in? Software development or financial services?
Filling cracks with epoxy after Tung oil
SQL Server Always On File Share Witness (Quorum vote) on different subnet to other nodes
Manager is threatning to grade me poorly if I don't complete the project
I caught several of my students plagiarizing. Could it be my fault as a teacher?
A mathematically illogical argument in the derivation of Hamilton's equation in Goldstein
Why Isn’t SQL More Refactorable?
What is a "listed natural gas appliance"?
How encryption in SQL login authentication works
When and why did journal article titles become descriptive, rather than creatively allusive?
Why was the battle set up *outside* Winterfell?
Did we get closer to another plane than we were supposed to, or was the pilot just protecting our delicate sensibilities?
Automatically use long arrows in display mode
Why is B♯ higher than C♭ in 31-ET?
What is Shri Venkateshwara Mangalasasana stotram recited for?
Is this homebrew life-stealing melee cantrip unbalanced?
What are the differences between credential stuffing and password spraying?
Should I replace my bicycle tires if they have not been inflated in multiple years
Junior developer struggles: how to communicate with management?
Why isn't nylon as strong as kevlar?
Where can I go to avoid planes overhead?
What word means "to make something obsolete"?
Aggregating target-encoded array-like categorical features?
Ground-truth and feature extraction for predictive modellingDo categorical features always need to be encoded?Suitable aggregations (mean, median or something else) to make features?Preparing, Scaling and Selecting from a combination of numerical and categorical featuresCatboost Categorical Features Handling Options (CTR settings)?Removing Categorial Features in Linear RegressionOne hot encoding vs Word embeddingOne Hot Encoding vs Word Embeding - When to choose one or another?How to handle continuous values and a binary target?Target Encoding: missing value imputation before or after encoding
$begingroup$
I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.
One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.
My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.
How should target encoded values be combined?
machine-learning feature-engineering encoding
$endgroup$
add a comment |
$begingroup$
I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.
One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.
My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.
How should target encoded values be combined?
machine-learning feature-engineering encoding
$endgroup$
add a comment |
$begingroup$
I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.
One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.
My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.
How should target encoded values be combined?
machine-learning feature-engineering encoding
$endgroup$
I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.
One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.
My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.
How should target encoded values be combined?
machine-learning feature-engineering encoding
machine-learning feature-engineering encoding
asked Apr 9 at 18:41
user4446237user4446237
1135
1135
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48987%2faggregating-target-encoded-array-like-categorical-features%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48987%2faggregating-target-encoded-array-like-categorical-features%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown