Strategies for handling unlabeled data which is slightly different from the labeled data Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsMulti-label text classification with minimum confidence thresholdClassifier and Technique to use for large number of categoriesIs this a correct way improving a statistical model?Cluster analysis as an associative model?Feature extraction from web browsing history of one websiteConsistently inconsistent cross-validation results that are wildly different from original model accuracyWhich machine (or deep) learning methods could suit my text classification problem?Imbalanced data causing mis-classification on multiclass datasetBuild train data set for natural language text classification?Keras- LSTM answers different size

What is a more techy Technical Writer job title that isn't cutesy or confusing?

How would you say "es muy psicólogo"?

License to disallow distribution in closed source software, but allow exceptions made by owner?

Why not send Voyager 3 and 4 following up the paths taken by Voyager 1 and 2 to re-transmit signals of later as they fly away from Earth?

What adaptations would allow standard fantasy dwarves to survive in the desert?

A proverb that is used to imply that you have unexpectedly faced a big problem

Can an iPhone 7 be made to function as a NFC Tag?

Tannaka duality for semisimple groups

Is CEO the "profession" with the most psychopaths?

Why is a lens darker than other ones when applying the same settings?

Resize vertical bars (absolute-value symbols)

Printing attributes of selection in ArcPy?

How do living politicians protect their readily obtainable signatures from misuse?

The Nth Gryphon Number

Universal covering space of the real projective line?

In musical terms, what properties are varied by the human voice to produce different words / syllables?

Does any scripture mention that forms of God or Goddess are symbolic?

How to ternary Plot3D a function

Is there hard evidence that the grant peer review system performs significantly better than random?

What does 丫 mean? 丫是什么意思?

Nose gear failure in single prop aircraft: belly landing or nose-gear up landing?

As a dual citizen, my US passport will expire one day after traveling to the US. Will this work?

"klopfte jemand" or "jemand klopfte"?

What are the main differences between Stargate SG-1 cuts?



Strategies for handling unlabeled data which is slightly different from the labeled data



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsMulti-label text classification with minimum confidence thresholdClassifier and Technique to use for large number of categoriesIs this a correct way improving a statistical model?Cluster analysis as an associative model?Feature extraction from web browsing history of one websiteConsistently inconsistent cross-validation results that are wildly different from original model accuracyWhich machine (or deep) learning methods could suit my text classification problem?Imbalanced data causing mis-classification on multiclass datasetBuild train data set for natural language text classification?Keras- LSTM answers different size










6












$begingroup$


Suppose you have a dataset with the following properties:



  1. The number of samples is fairly large (~100K samples)

  2. There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

  3. Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

  4. Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

  5. The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

  6. The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?



Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!










share|improve this question









$endgroup$







  • 1




    $begingroup$
    Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
    $endgroup$
    – n1tk
    Apr 5 at 4:10











  • $begingroup$
    @n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
    $endgroup$
    – Mathias
    Apr 6 at 10:16







  • 1




    $begingroup$
    What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
    $endgroup$
    – Pedro Henrique Monforte
    Apr 9 at 22:43










  • $begingroup$
    @PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
    $endgroup$
    – Mathias
    Apr 11 at 7:10















6












$begingroup$


Suppose you have a dataset with the following properties:



  1. The number of samples is fairly large (~100K samples)

  2. There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

  3. Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

  4. Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

  5. The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

  6. The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?



Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!










share|improve this question









$endgroup$







  • 1




    $begingroup$
    Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
    $endgroup$
    – n1tk
    Apr 5 at 4:10











  • $begingroup$
    @n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
    $endgroup$
    – Mathias
    Apr 6 at 10:16







  • 1




    $begingroup$
    What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
    $endgroup$
    – Pedro Henrique Monforte
    Apr 9 at 22:43










  • $begingroup$
    @PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
    $endgroup$
    – Mathias
    Apr 11 at 7:10













6












6








6


1



$begingroup$


Suppose you have a dataset with the following properties:



  1. The number of samples is fairly large (~100K samples)

  2. There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

  3. Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

  4. Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

  5. The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

  6. The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?



Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!










share|improve this question









$endgroup$




Suppose you have a dataset with the following properties:



  1. The number of samples is fairly large (~100K samples)

  2. There are ~150 contextual features and 1 feature consisting of a text-string (which can, of course, be split into any number of features depending on the pre-processing of the text). It is expected that the text-string will have really great predictive power

  3. Samples are divided into 3 categories (prior to you receiving the data) based on a few of the contextual features with category A containing ~5% of the samples, category B containing ~20%, and category C containing the remaining 75%

  4. Category A is entirely labeled, category B is partly labeled (with only a small proportion being unlabeled), and category C is entirely unlabeled

  5. The features used to categorize the samples are likely to influence the probability of a sample belonging to class 0 or class 1.

  6. The samples are not completely different between categories (that is to say, we're not talking cats versus dogs). E.g.: Two very similar samples might end up in different categories based on very small differences on a numerical feature with a large range

The purpose is to build a classifier which will correctly classify the samples. That might look like a semi-supervised learning-problem, but I am worried about the structural differences between the categories. Hence my question: Which strategies could be employed to build a classifier performing well on all of the samples?



Of course I could just be conservative and only deal with the labeled data, but there is great value in also being able to predict the unlabeled data (e.g. the 75% of the data in category C). That's why I'll try to pick your brains for creative solutions!







machine-learning python classification






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 4 at 9:37









MathiasMathias

344




344







  • 1




    $begingroup$
    Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
    $endgroup$
    – n1tk
    Apr 5 at 4:10











  • $begingroup$
    @n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
    $endgroup$
    – Mathias
    Apr 6 at 10:16







  • 1




    $begingroup$
    What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
    $endgroup$
    – Pedro Henrique Monforte
    Apr 9 at 22:43










  • $begingroup$
    @PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
    $endgroup$
    – Mathias
    Apr 11 at 7:10












  • 1




    $begingroup$
    Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
    $endgroup$
    – n1tk
    Apr 5 at 4:10











  • $begingroup$
    @n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
    $endgroup$
    – Mathias
    Apr 6 at 10:16







  • 1




    $begingroup$
    What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
    $endgroup$
    – Pedro Henrique Monforte
    Apr 9 at 22:43










  • $begingroup$
    @PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
    $endgroup$
    – Mathias
    Apr 11 at 7:10







1




1




$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10





$begingroup$
Have you considered clustering ? One of other approach I can think of is to train the model on the labeled data and new incoming data (class C) should be as anomaly ... without sample of data can’t think of of a POC other than the approach used in Cybersecurity when training on “true” data and than predicting the “anomaly”.
$endgroup$
– n1tk
Apr 5 at 4:10













$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16





$begingroup$
@n1tk Would you mind elaborating on your thoughts on using clustering? I'm struggling to find a way to actually build (and evaluate) a classifier using clustering. And yeah, your other ideas about the anomaly-approach is pretty much what I might end up doing if nothing better pops up.
$endgroup$
– Mathias
Apr 6 at 10:16





1




1




$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43




$begingroup$
What do you mean by unlabeled? It is not already labeled as category C? If so, what is the problem? Category? Could you provide us with data examples (even if they are fake but hold the same structure as your data)
$endgroup$
– Pedro Henrique Monforte
Apr 9 at 22:43












$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10




$begingroup$
@PedroHenriqueMonforte Sorry for not making myself clear. The task is to predict class 0 or class 1. All of the data is categorized into A, B, and C-classes, but the problem is that only part of the data is labeled in terms of class 0 or class 1.
$endgroup$
– Mathias
Apr 11 at 7:10










0






active

oldest

votes












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48592%2fstrategies-for-handling-unlabeled-data-which-is-slightly-different-from-the-labe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48592%2fstrategies-for-handling-unlabeled-data-which-is-slightly-different-from-the-labe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High