Strategies for handling unlabeled data which is slightly different from the labeled data Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsMulti-label text classification with minimum confidence thresholdClassifier and Technique to use for large number of categoriesIs this a correct way improving a statistical model?Cluster analysis as an associative model?Feature extraction from web browsing history of one websiteConsistently inconsistent cross-validation results that are wildly different from original model accuracyWhich machine (or deep) learning methods could suit my text classification problem?Imbalanced data causing mis-classification on multiclass datasetBuild train data set for natural language text classification?Keras- LSTM answers different size
What is a more techy Technical Writer job title that isn't cutesy or confusing?
How would you say "es muy psicólogo"?
License to disallow distribution in closed source software, but allow exceptions made by owner?
Why not send Voyager 3 and 4 following up the paths taken by Voyager 1 and 2 to re-transmit signals of later as they fly away from Earth?
What adaptations would allow standard fantasy dwarves to survive in the desert?
A proverb that is used to imply that you have unexpectedly faced a big problem
Can an iPhone 7 be made to function as a NFC Tag?
Tannaka duality for semisimple groups
Is CEO the "profession" with the most psychopaths?
Why is a lens darker than other ones when applying the same settings?
Resize vertical bars (absolute-value symbols)
Printing attributes of selection in ArcPy?
How do living politicians protect their readily obtainable signatures from misuse?
The Nth Gryphon Number
Universal covering space of the real projective line?
In musical terms, what properties are varied by the human voice to produce different words / syllables?
Does any scripture mention that forms of God or Goddess are symbolic?
How to ternary Plot3D a function
Is there hard evidence that the grant peer review system performs significantly better than random?
What does 丫 mean? 丫是什么意思?
Nose gear failure in single prop aircraft: belly landing or nose-gear up landing?
As a dual citizen, my US passport will expire one day after traveling to the US. Will this work?
"klopfte jemand" or "jemand klopfte"?
What are the main differences between Stargate SG-1 cuts?
Strategies for handling unlabeled data which is slightly different from the labeled data
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsMulti-label text classification with minimum confidence thresholdClassifier and Technique to use for large number of categoriesIs this a correct way improving a statistical model?Cluster analysis as an associative model?Feature extraction from web browsing history of one websiteConsistently inconsistent cross-validation results that are wildly different from original model accuracyWhich machine (or deep) learning methods could suit my text classification problem?Imbalanced data causing mis-classification on multiclass datasetBuild train data set for natural language text classification?Keras- LSTM answers different size
$begingroup$
Suppose you have a dataset with the following properties:
- The number of samples is fairly large (~100K samples)
- There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power
- Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%
- Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled
- The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.
- The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range
The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?
Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!
machine-learning python classification
$endgroup$
add a comment |
$begingroup$
Suppose you have a dataset with the following properties:
- The number of samples is fairly large (~100K samples)
- There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power
- Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%
- Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled
- The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.
- The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range
The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?
Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!
machine-learning python classification
$endgroup$
1
$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10
$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16
1
$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43
$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10
add a comment |
$begingroup$
Suppose you have a dataset with the following properties:
- The number of samples is fairly large (~100K samples)
- There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power
- Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%
- Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled
- The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.
- The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range
The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?
Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!
machine-learning python classification
$endgroup$
Suppose you have a dataset with the following properties:
- The number of samples is fairly large (~100K samples)
- There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power
- Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%
- Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled
- The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.
- The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range
The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?
Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!
machine-learning python classification
machine-learning python classification
asked Apr 4 at 9:37
MathiasMathias
344
344
1
$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10
$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16
1
$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43
$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10
add a comment |
1
$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10
$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16
1
$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43
$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10
1
1
$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10
$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10
$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16
$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16
1
1
$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43
$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43
$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10
$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48592%2fstrategies-for-handling-unlabeled-data-which-is-slightly-different-from-the-labe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48592%2fstrategies-for-handling-unlabeled-data-which-is-slightly-different-from-the-labe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10
$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16
1
$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43
$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10