Strategies for handling unlabeled data which is slightly different from the labeled data Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsMulti-label text classification with minimum confidence thresholdClassifier and Technique to use for large number of categoriesIs this a correct way improving a statistical model?Cluster analysis as an associative model?Feature extraction from web browsing history of one websiteConsistently inconsistent cross-validation results that are wildly different from original model accuracyWhich machine (or deep) learning methods could suit my text classification problem?Imbalanced data causing mis-classification on multiclass datasetBuild train data set for natural language text classification?Keras- LSTM answers different size

What is a more techy Technical Writer job title that isn't cutesy or confusing?

How would you say "es muy psicólogo"?

License to disallow distribution in closed source software, but allow exceptions made by owner?

Why not send Voyager 3 and 4 following up the paths taken by Voyager 1 and 2 to re-transmit signals of later as they fly away from Earth?

What adaptations would allow standard fantasy dwarves to survive in the desert?

A proverb that is used to imply that you have unexpectedly faced a big problem

Can an iPhone 7 be made to function as a NFC Tag?

Tannaka duality for semisimple groups

Is CEO the "profession" with the most psychopaths?

Why is a lens darker than other ones when applying the same settings?

Resize vertical bars (absolute-value symbols)

Printing attributes of selection in ArcPy?

How do living politicians protect their readily obtainable signatures from misuse?

The Nth Gryphon Number

Universal covering space of the real projective line?

In musical terms, what properties are varied by the human voice to produce different words / syllables?

Does any scripture mention that forms of God or Goddess are symbolic?

How to ternary Plot3D a function

Is there hard evidence that the grant peer review system performs significantly better than random?

What does 丫 mean? 丫是什么意思?

Nose gear failure in single prop aircraft: belly landing or nose-gear up landing?

As a dual citizen, my US passport will expire one day after traveling to the US. Will this work?

"klopfte jemand" or "jemand klopfte"?

What are the main differences between Stargate SG-1 cuts?



Strategies for handling unlabeled data which is slightly different from the labeled data



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsMulti-label text classification with minimum confidence thresholdClassifier and Technique to use for large number of categoriesIs this a correct way improving a statistical model?Cluster analysis as an associative model?Feature extraction from web browsing history of one websiteConsistently inconsistent cross-validation results that are wildly different from original model accuracyWhich machine (or deep) learning methods could suit my text classification problem?Imbalanced data causing mis-classification on multiclass datasetBuild train data set for natural language text classification?Keras- LSTM answers different size










6












$begingroup$


Suppose you have a dataset with the following properties:



  1. The number of samples is fairly large (~100K samples)

  2. There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

  3. Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

  4. Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

  5. The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

  6. The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?



Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!










share|improve this question









$endgroup$







  • 1




    $begingroup$
    Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
    $endgroup$
    – n1tk
    Apr 5 at 4:10











  • $begingroup$
    @n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
    $endgroup$
    – Mathias
    Apr 6 at 10:16







  • 1




    $begingroup$
    What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
    $endgroup$
    – Pedro Henrique Monforte
    Apr 9 at 22:43










  • $begingroup$
    @PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
    $endgroup$
    – Mathias
    Apr 11 at 7:10















6












$begingroup$


Suppose you have a dataset with the following properties:



  1. The number of samples is fairly large (~100K samples)

  2. There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

  3. Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

  4. Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

  5. The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

  6. The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?



Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!










share|improve this question









$endgroup$







  • 1




    $begingroup$
    Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
    $endgroup$
    – n1tk
    Apr 5 at 4:10











  • $begingroup$
    @n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
    $endgroup$
    – Mathias
    Apr 6 at 10:16







  • 1




    $begingroup$
    What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
    $endgroup$
    – Pedro Henrique Monforte
    Apr 9 at 22:43










  • $begingroup$
    @PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
    $endgroup$
    – Mathias
    Apr 11 at 7:10













6












6








6


1



$begingroup$


Suppose you have a dataset with the following properties:



  1. The number of samples is fairly large (~100K samples)

  2. There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

  3. Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

  4. Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

  5. The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

  6. The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?



Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!










share|improve this question









$endgroup$




Suppose you have a dataset with the following properties:



  1. The number of samples is fairly large (~100K samples)

  2. There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

  3. Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

  4. Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

  5. The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

  6. The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?



Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!







machine-learning python classification






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 4 at 9:37









MathiasMathias

344




344







  • 1




    $begingroup$
    Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
    $endgroup$
    – n1tk
    Apr 5 at 4:10











  • $begingroup$
    @n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
    $endgroup$
    – Mathias
    Apr 6 at 10:16







  • 1




    $begingroup$
    What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
    $endgroup$
    – Pedro Henrique Monforte
    Apr 9 at 22:43










  • $begingroup$
    @PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
    $endgroup$
    – Mathias
    Apr 11 at 7:10












  • 1




    $begingroup$
    Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
    $endgroup$
    – n1tk
    Apr 5 at 4:10











  • $begingroup$
    @n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
    $endgroup$
    – Mathias
    Apr 6 at 10:16







  • 1




    $begingroup$
    What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
    $endgroup$
    – Pedro Henrique Monforte
    Apr 9 at 22:43










  • $begingroup$
    @PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
    $endgroup$
    – Mathias
    Apr 11 at 7:10







1




1




$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10





$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10













$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16





$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16





1




1




$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43




$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43












$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10




$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10










0






active

oldest

votes












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48592%2fstrategies-for-handling-unlabeled-data-which-is-slightly-different-from-the-labe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48592%2fstrategies-for-handling-unlabeled-data-which-is-slightly-different-from-the-labe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Marja Vauras Lähteet | Aiheesta muualla | NavigointivalikkoMarja Vauras Turun yliopiston tutkimusportaalissaInfobox OKSuomalaisen Tiedeakatemian varsinaiset jäsenetKasvatustieteiden tiedekunnan dekaanit ja muu johtoMarja VaurasKoulutusvienti on kestävyys- ja ketteryyslaji (2.5.2017)laajentamallaWorldCat Identities0000 0001 0855 9405n86069603utb201588738523620927

Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task

Is this part of the description of the Archfey warlock's Misty Escape feature redundant?When is entropic ward considered “used”?How does the reaction timing work for Wrath of the Storm? Can it potentially prevent the damage from the triggering attack?Does the Dark Arts Archlich warlock patrons's Arcane Invisibility activate every time you cast a level 1+ spell?When attacking while invisible, when exactly does invisibility break?Can I cast Hellish Rebuke on my turn?Do I have to “pre-cast” a reaction spell in order for it to be triggered?What happens if a Player Misty Escapes into an Invisible CreatureCan a reaction interrupt multiattack?Does the Fiend-patron warlock's Hurl Through Hell feature dispel effects that require the target to be on the same plane as the caster?What are you allowed to do while using the Warlock's Eldritch Master feature?