Strategies for handling unlabeled data which is slightly different from the labeled data Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsMulti-label text classification with minimum confidence thresholdClassifier and Technique to use for large number of categoriesIs this a correct way improving a statistical model?Cluster analysis as an associative model?Feature extraction from web browsing history of one websiteConsistently inconsistent cross-validation results that are wildly different from original model accuracyWhich machine (or deep) learning methods could suit my text classification problem?Imbalanced data causing mis-classification on multiclass datasetBuild train data set for natural language text classification?Keras- LSTM answers different size

What is a more techy Technical Writer job title that isn't cutesy or confusing?

How would you say "es muy psicólogo"?

License to disallow distribution in closed source software, but allow exceptions made by owner?

Why not send Voyager 3 and 4 following up the paths taken by Voyager 1 and 2 to re-transmit signals of later as they fly away from Earth?

What adaptations would allow standard fantasy dwarves to survive in the desert?

A proverb that is used to imply that you have unexpectedly faced a big problem

Can an iPhone 7 be made to function as a NFC Tag?

Tannaka duality for semisimple groups

Is CEO the "profession" with the most psychopaths?

Why is a lens darker than other ones when applying the same settings?

Resize vertical bars (absolute-value symbols)

Printing attributes of selection in ArcPy?

How do living politicians protect their readily obtainable signatures from misuse?

The Nth Gryphon Number

Universal covering space of the real projective line?

In musical terms, what properties are varied by the human voice to produce different words / syllables?

Does any scripture mention that forms of God or Goddess are symbolic?

How to ternary Plot3D a function

Is there hard evidence that the grant peer review system performs significantly better than random?

What does 丫 mean? 丫是什么意思？

Nose gear failure in single prop aircraft: belly landing or nose-gear up landing?

As a dual citizen, my US passport will expire one day after traveling to the US. Will this work?

"klopfte jemand" or "jemand klopfte"?

What are the main differences between Stargate SG-1 cuts?

Strategies for handling unlabeled data which is slightly different from the labeled data

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsMulti-label text classification with minimum confidence thresholdClassifier and Technique to use for large number of categoriesIs this a correct way improving a statistical model?Cluster analysis as an associative model?Feature extraction from web browsing history of one websiteConsistently inconsistent cross-validation results that are wildly different from original model accuracyWhich machine (or deep) learning methods could suit my text classification problem?Imbalanced data causing mis-classification on multiclass datasetBuild train data set for natural language text classification?Keras- LSTM answers different size

Suppose you have a dataset with the following properties:

The number of samples is fairly large (~100K samples)

There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?

Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!

asked Apr 4 at 9:37

Mathias

344

1

$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10

$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16

1

$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43

$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10

add a comment |

Suppose you have a dataset with the following properties:

The number of samples is fairly large (~100K samples)

There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

asked Apr 4 at 9:37

Mathias

344

1

$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10

$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16

1

$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43

$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10

add a comment |

Suppose you have a dataset with the following properties:

The number of samples is fairly large (~100K samples)

There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

asked Apr 4 at 9:37

Mathias

344

Suppose you have a dataset with the following properties:

The number of samples is fairly large (~100K samples)

There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

machine-learning python classification

asked Apr 4 at 9:37

Mathias

344

asked Apr 4 at 9:37

Mathias

344

asked Apr 4 at 9:37

Mathias

344

asked Apr 4 at 9:37

Mathias

344

asked Apr 4 at 9:37

Mathias

344

1

$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10

$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16

1

$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43

$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10

add a comment |

1

$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10

$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16

1

$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43

$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10

Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.

– n1tk
Apr 5 at 4:10

@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.

– Mathias
Apr 6 at 10:16

What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)

– Pedro Henrique Monforte
Apr 9 at 22:43

@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.

– Mathias
Apr 11 at 7:10

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48592%2fstrategies-for-handling-unlabeled-data-which-is-slightly-different-from-the-labe%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

A 1DIIgtrzTuBvSqSOTa,ry,vXbxuTK VZrt G,HcZ5W97E,Xrd3 wRiP,5Qfh0JQ9vlu01V

搜尋此網誌

Trjtdtk

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli