Train classifier on balanced dataset and apply on imbalanced dataset? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsWhy real-world output of my classifier has similar label ratio to training data?Python : How to use Multinomial Logistic Regression using SKlearnBinary Classification on small dataset < 200 samplesImbalanced classification data with a top decile conversion metricWhich Classification Metrics Are Appropriate For Each Class Distribution Scenario?Creating Balanced Dataset Using ScikitsUnbalanced multi-class : distribution might change as more data come inIs there any way how to make samples balanced?Classifying on imbalanced datasetCreate an API from EDA or ML outcome?Multilabel classifcation in sklearn with soft (fuzzy) labels
Putting Ant-Man on house arrest
What is the best way to deal with NPC-NPC combat?
Contradiction proof for inequality of P and NP?
Intern got a job offer for same salary than a long term team member
Unable to completely uninstall Zoom meeting app
Arriving in Atlanta after US Preclearance in Dublin. Will I go through TSA security in Atlanta to transfer to a connecting flight?
Protagonist's race is hidden - should I reveal it?
My bank got bought out, am I now going to have to start filing tax returns in a different state?
Why did C use the -> operator instead of reusing the . operator?
Scheduling based problem
Do I need to watch Ant-Man and the Wasp and Captain Marvel before watching Avengers: Endgame?
Has a Nobel Peace laureate ever been accused of war crimes?
What does a straight horizontal line above a few notes, after a changed tempo mean?
"Whatever a Russian does, they end up making the Kalashnikov gun"? Are there any similar proverbs in English?
What is the term for a person whose job is to place products on shelves in stores?
How would this chord from "Rocket Man" be analyzed?
Why must Chinese maps be obfuscated?
How to open locks without disable device?
As an international instructor, should I openly talk about my accent?
Is there metaphorical meaning of "aus der Haft entlassen"?
How much of a wave function must reside inside event horizon for it to be consumed by the black hole?
"My boss was furious with me and I have been fired" vs. "My boss was furious with me and I was fired"
What makes accurate emulation of old systems a difficult task?
Does Feeblemind produce an ongoing magical effect that can be dispelled?
Train classifier on balanced dataset and apply on imbalanced dataset?
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsWhy real-world output of my classifier has similar label ratio to training data?Python : How to use Multinomial Logistic Regression using SKlearnBinary Classification on small dataset < 200 samplesImbalanced classification data with a top decile conversion metricWhich Classification Metrics Are Appropriate For Each Class Distribution Scenario?Creating Balanced Dataset Using ScikitsUnbalanced multi-class : distribution might change as more data come inIs there any way how to make samples balanced?Classifying on imbalanced datasetCreate an API from EDA or ML outcome?Multilabel classifcation in sklearn with soft (fuzzy) labels
$begingroup$
I have a labelled training dataset DS1 with 1000 entries. The targets (True/False) are nearly balanced. With sklearn, I have tried several algorithms, of which the GradientBoostingClassifier works best with F-Score ~0.83.
Now, I have to apply the trained classifier on an unlabelled dataset DS2 with ~ 5 million entries (and same features). However, for DS2, the target distribution is expected to be highly unbalanced.
Is this a problem? Will the model reproduce the trained target distribution from DS1 when applied on DS2?
If yes, would another algorithm be more robust?
scikit-learn data class-imbalance gbm
$endgroup$
add a comment |
$begingroup$
I have a labelled training dataset DS1 with 1000 entries. The targets (True/False) are nearly balanced. With sklearn, I have tried several algorithms, of which the GradientBoostingClassifier works best with F-Score ~0.83.
Now, I have to apply the trained classifier on an unlabelled dataset DS2 with ~ 5 million entries (and same features). However, for DS2, the target distribution is expected to be highly unbalanced.
Is this a problem? Will the model reproduce the trained target distribution from DS1 when applied on DS2?
If yes, would another algorithm be more robust?
scikit-learn data class-imbalance gbm
$endgroup$
2
$begingroup$
I think this may depend quite a bit on why your training data was balanced but your test data is not.
$endgroup$
– Ben Reiniger
Mar 5 at 22:07
$begingroup$
@BenReiniger, could you elaborate on that?
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
If you know the real distribution (or have an idea of it) you can use it weight training samples. (You could try using an Inner Product Detector or a Kernel Inner Product Detector if you have the probabilities of the classes, I can provide you with the module but I never used it outside of Computer Vision applications, actually no one ever did, KIPD was published 3 months ago)
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 22:12
add a comment |
$begingroup$
I have a labelled training dataset DS1 with 1000 entries. The targets (True/False) are nearly balanced. With sklearn, I have tried several algorithms, of which the GradientBoostingClassifier works best with F-Score ~0.83.
Now, I have to apply the trained classifier on an unlabelled dataset DS2 with ~ 5 million entries (and same features). However, for DS2, the target distribution is expected to be highly unbalanced.
Is this a problem? Will the model reproduce the trained target distribution from DS1 when applied on DS2?
If yes, would another algorithm be more robust?
scikit-learn data class-imbalance gbm
$endgroup$
I have a labelled training dataset DS1 with 1000 entries. The targets (True/False) are nearly balanced. With sklearn, I have tried several algorithms, of which the GradientBoostingClassifier works best with F-Score ~0.83.
Now, I have to apply the trained classifier on an unlabelled dataset DS2 with ~ 5 million entries (and same features). However, for DS2, the target distribution is expected to be highly unbalanced.
Is this a problem? Will the model reproduce the trained target distribution from DS1 when applied on DS2?
If yes, would another algorithm be more robust?
scikit-learn data class-imbalance gbm
scikit-learn data class-imbalance gbm
edited Apr 7 at 17:27
Stephen Rauch♦
1,52551330
1,52551330
asked Mar 5 at 16:10
user3240855user3240855
162
162
2
$begingroup$
I think this may depend quite a bit on why your training data was balanced but your test data is not.
$endgroup$
– Ben Reiniger
Mar 5 at 22:07
$begingroup$
@BenReiniger, could you elaborate on that?
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
If you know the real distribution (or have an idea of it) you can use it weight training samples. (You could try using an Inner Product Detector or a Kernel Inner Product Detector if you have the probabilities of the classes, I can provide you with the module but I never used it outside of Computer Vision applications, actually no one ever did, KIPD was published 3 months ago)
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 22:12
add a comment |
2
$begingroup$
I think this may depend quite a bit on why your training data was balanced but your test data is not.
$endgroup$
– Ben Reiniger
Mar 5 at 22:07
$begingroup$
@BenReiniger, could you elaborate on that?
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
If you know the real distribution (or have an idea of it) you can use it weight training samples. (You could try using an Inner Product Detector or a Kernel Inner Product Detector if you have the probabilities of the classes, I can provide you with the module but I never used it outside of Computer Vision applications, actually no one ever did, KIPD was published 3 months ago)
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 22:12
2
2
$begingroup$
I think this may depend quite a bit on why your training data was balanced but your test data is not.
$endgroup$
– Ben Reiniger
Mar 5 at 22:07
$begingroup$
I think this may depend quite a bit on why your training data was balanced but your test data is not.
$endgroup$
– Ben Reiniger
Mar 5 at 22:07
$begingroup$
@BenReiniger, could you elaborate on that?
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
@BenReiniger, could you elaborate on that?
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
If you know the real distribution (or have an idea of it) you can use it weight training samples. (You could try using an Inner Product Detector or a Kernel Inner Product Detector if you have the probabilities of the classes, I can provide you with the module but I never used it outside of Computer Vision applications, actually no one ever did, KIPD was published 3 months ago)
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 22:12
$begingroup$
If you know the real distribution (or have an idea of it) you can use it weight training samples. (You could try using an Inner Product Detector or a Kernel Inner Product Detector if you have the probabilities of the classes, I can provide you with the module but I never used it outside of Computer Vision applications, actually no one ever did, KIPD was published 3 months ago)
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 22:12
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
Is this a problem?
No. not at all.
Will the model reproduce the trained target distribution from DS1 when
applied on DS2?
No, not necessarily. If
- Balanced set DS1 is a good representative of imbalanced (target) set DS2, and
- Classes are well-separated (pointed out by @BenReiniger), which holds easier in higher dimensions,
then model will generate labels with a ratio close to imbalanced DS2 not balanced DS1. Here is a visual example (drawn by myself):

As you see, in the good training set, predictions resemble the real-world ratio, even though model is trained on a balanced set, of course, if classifier does a good job of finding the decision boundaries.
$endgroup$
1
$begingroup$
As I said elsewhere, I really like this image, but I think it also here illustrates my suggestion about cleanly separable data. Note OP's comment on georg_un's answer: the model really is building in the average response rate.
$endgroup$
– Ben Reiniger
Apr 10 at 21:23
$begingroup$
@BenReiniger Thanks for your comment! it helped me think more critically. From OP's comment I could think of the 2nd case where either the training data is not a good representative or classifier is not doing well. About "clearly separable" I agree with you. However, I think the condition becomes easier to satisfy on higher dimensions, which then, classes can be sketched as well-separated areas. For example, there is too much data in image of dogs and cats to tell them apart compared to 1D data.
$endgroup$
– Esmailian
Apr 10 at 22:43
add a comment |
$begingroup$
For prediction, the GradientBoostingClassifier will only take those features in account that you fed it during training and it will then classify each observation on its own. That means that usually you don't have to worry about the target-distribution of your prediction-dataset, as long you trained your model on a sufficiently extensive training dataset.
$endgroup$
1
$begingroup$
Hm, my training data has 35% positive target ratio. If applied to DS2, the model finds ~35% positive ratio there as well, even though I'd expect it to be <5% ...
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
A balanced training dataset has a more or less equal distribution of the contained classes. In your case, this would be a distribution of about 50% positive. There are a lot of methods of how to generate an equal distribution by downsampling or oversampling (see this link)
$endgroup$
– georg_un
Mar 7 at 11:25
$begingroup$
At the moment there are two possibilities: either your expectation regarding the prediction data set is wrong or the classifier is inaccurate. For this reason, it is common practice to divide the labeled data into a training and a test data set (e.g. in the ratio 70:30 or 80:20) and to examine the performance of the classifier on the test data set. I would recommend doing so and then investigating your results further. Don't forget that you can print out feature importances of the GradientBoostingClassifier withfeature_importances_.
$endgroup$
– georg_un
Mar 7 at 11:26
add a comment |
$begingroup$
A GBM will ultimately try to split your data into rectangular regions and assign each one a constant predicted probability, the proportion of positive training examples in that region. So yes, on the whole the model has baked in the training sample's average response rate.
I think that effect will be lessened if your data is particularly cleanly separable: if each rectangular region is pure, and your test data just happens to be more heavily inclined toward the negative regions, then it will naturally get closer to "the right" answer.
I'm not sure about other models that would be more robust in this way...an SVM probably, not being naturally probabilistic in the first place.
If your context is downsampling, logistic regression has a well-known adjustment for exactly this problem. The same adjustment (to log-odds) seems likely to help in the GBM context as well, though I'm not aware of any analysis to back it up.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46709%2ftrain-classifier-on-balanced-dataset-and-apply-on-imbalanced-dataset%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Is this a problem?
No. not at all.
Will the model reproduce the trained target distribution from DS1 when
applied on DS2?
No, not necessarily. If
- Balanced set DS1 is a good representative of imbalanced (target) set DS2, and
- Classes are well-separated (pointed out by @BenReiniger), which holds easier in higher dimensions,
then model will generate labels with a ratio close to imbalanced DS2 not balanced DS1. Here is a visual example (drawn by myself):

As you see, in the good training set, predictions resemble the real-world ratio, even though model is trained on a balanced set, of course, if classifier does a good job of finding the decision boundaries.
$endgroup$
1
$begingroup$
As I said elsewhere, I really like this image, but I think it also here illustrates my suggestion about cleanly separable data. Note OP's comment on georg_un's answer: the model really is building in the average response rate.
$endgroup$
– Ben Reiniger
Apr 10 at 21:23
$begingroup$
@BenReiniger Thanks for your comment! it helped me think more critically. From OP's comment I could think of the 2nd case where either the training data is not a good representative or classifier is not doing well. About "clearly separable" I agree with you. However, I think the condition becomes easier to satisfy on higher dimensions, which then, classes can be sketched as well-separated areas. For example, there is too much data in image of dogs and cats to tell them apart compared to 1D data.
$endgroup$
– Esmailian
Apr 10 at 22:43
add a comment |
$begingroup$
Is this a problem?
No. not at all.
Will the model reproduce the trained target distribution from DS1 when
applied on DS2?
No, not necessarily. If
- Balanced set DS1 is a good representative of imbalanced (target) set DS2, and
- Classes are well-separated (pointed out by @BenReiniger), which holds easier in higher dimensions,
then model will generate labels with a ratio close to imbalanced DS2 not balanced DS1. Here is a visual example (drawn by myself):

As you see, in the good training set, predictions resemble the real-world ratio, even though model is trained on a balanced set, of course, if classifier does a good job of finding the decision boundaries.
$endgroup$
1
$begingroup$
As I said elsewhere, I really like this image, but I think it also here illustrates my suggestion about cleanly separable data. Note OP's comment on georg_un's answer: the model really is building in the average response rate.
$endgroup$
– Ben Reiniger
Apr 10 at 21:23
$begingroup$
@BenReiniger Thanks for your comment! it helped me think more critically. From OP's comment I could think of the 2nd case where either the training data is not a good representative or classifier is not doing well. About "clearly separable" I agree with you. However, I think the condition becomes easier to satisfy on higher dimensions, which then, classes can be sketched as well-separated areas. For example, there is too much data in image of dogs and cats to tell them apart compared to 1D data.
$endgroup$
– Esmailian
Apr 10 at 22:43
add a comment |
$begingroup$
Is this a problem?
No. not at all.
Will the model reproduce the trained target distribution from DS1 when
applied on DS2?
No, not necessarily. If
- Balanced set DS1 is a good representative of imbalanced (target) set DS2, and
- Classes are well-separated (pointed out by @BenReiniger), which holds easier in higher dimensions,
then model will generate labels with a ratio close to imbalanced DS2 not balanced DS1. Here is a visual example (drawn by myself):

As you see, in the good training set, predictions resemble the real-world ratio, even though model is trained on a balanced set, of course, if classifier does a good job of finding the decision boundaries.
$endgroup$
Is this a problem?
No. not at all.
Will the model reproduce the trained target distribution from DS1 when
applied on DS2?
No, not necessarily. If
- Balanced set DS1 is a good representative of imbalanced (target) set DS2, and
- Classes are well-separated (pointed out by @BenReiniger), which holds easier in higher dimensions,
then model will generate labels with a ratio close to imbalanced DS2 not balanced DS1. Here is a visual example (drawn by myself):

As you see, in the good training set, predictions resemble the real-world ratio, even though model is trained on a balanced set, of course, if classifier does a good job of finding the decision boundaries.
edited Apr 11 at 14:29
answered Apr 7 at 18:53
EsmailianEsmailian
3,771420
3,771420
1
$begingroup$
As I said elsewhere, I really like this image, but I think it also here illustrates my suggestion about cleanly separable data. Note OP's comment on georg_un's answer: the model really is building in the average response rate.
$endgroup$
– Ben Reiniger
Apr 10 at 21:23
$begingroup$
@BenReiniger Thanks for your comment! it helped me think more critically. From OP's comment I could think of the 2nd case where either the training data is not a good representative or classifier is not doing well. About "clearly separable" I agree with you. However, I think the condition becomes easier to satisfy on higher dimensions, which then, classes can be sketched as well-separated areas. For example, there is too much data in image of dogs and cats to tell them apart compared to 1D data.
$endgroup$
– Esmailian
Apr 10 at 22:43
add a comment |
1
$begingroup$
As I said elsewhere, I really like this image, but I think it also here illustrates my suggestion about cleanly separable data. Note OP's comment on georg_un's answer: the model really is building in the average response rate.
$endgroup$
– Ben Reiniger
Apr 10 at 21:23
$begingroup$
@BenReiniger Thanks for your comment! it helped me think more critically. From OP's comment I could think of the 2nd case where either the training data is not a good representative or classifier is not doing well. About "clearly separable" I agree with you. However, I think the condition becomes easier to satisfy on higher dimensions, which then, classes can be sketched as well-separated areas. For example, there is too much data in image of dogs and cats to tell them apart compared to 1D data.
$endgroup$
– Esmailian
Apr 10 at 22:43
1
1
$begingroup$
As I said elsewhere, I really like this image, but I think it also here illustrates my suggestion about cleanly separable data. Note OP's comment on georg_un's answer: the model really is building in the average response rate.
$endgroup$
– Ben Reiniger
Apr 10 at 21:23
$begingroup$
As I said elsewhere, I really like this image, but I think it also here illustrates my suggestion about cleanly separable data. Note OP's comment on georg_un's answer: the model really is building in the average response rate.
$endgroup$
– Ben Reiniger
Apr 10 at 21:23
$begingroup$
@BenReiniger Thanks for your comment! it helped me think more critically. From OP's comment I could think of the 2nd case where either the training data is not a good representative or classifier is not doing well. About "clearly separable" I agree with you. However, I think the condition becomes easier to satisfy on higher dimensions, which then, classes can be sketched as well-separated areas. For example, there is too much data in image of dogs and cats to tell them apart compared to 1D data.
$endgroup$
– Esmailian
Apr 10 at 22:43
$begingroup$
@BenReiniger Thanks for your comment! it helped me think more critically. From OP's comment I could think of the 2nd case where either the training data is not a good representative or classifier is not doing well. About "clearly separable" I agree with you. However, I think the condition becomes easier to satisfy on higher dimensions, which then, classes can be sketched as well-separated areas. For example, there is too much data in image of dogs and cats to tell them apart compared to 1D data.
$endgroup$
– Esmailian
Apr 10 at 22:43
add a comment |
$begingroup$
For prediction, the GradientBoostingClassifier will only take those features in account that you fed it during training and it will then classify each observation on its own. That means that usually you don't have to worry about the target-distribution of your prediction-dataset, as long you trained your model on a sufficiently extensive training dataset.
$endgroup$
1
$begingroup$
Hm, my training data has 35% positive target ratio. If applied to DS2, the model finds ~35% positive ratio there as well, even though I'd expect it to be <5% ...
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
A balanced training dataset has a more or less equal distribution of the contained classes. In your case, this would be a distribution of about 50% positive. There are a lot of methods of how to generate an equal distribution by downsampling or oversampling (see this link)
$endgroup$
– georg_un
Mar 7 at 11:25
$begingroup$
At the moment there are two possibilities: either your expectation regarding the prediction data set is wrong or the classifier is inaccurate. For this reason, it is common practice to divide the labeled data into a training and a test data set (e.g. in the ratio 70:30 or 80:20) and to examine the performance of the classifier on the test data set. I would recommend doing so and then investigating your results further. Don't forget that you can print out feature importances of the GradientBoostingClassifier withfeature_importances_.
$endgroup$
– georg_un
Mar 7 at 11:26
add a comment |
$begingroup$
For prediction, the GradientBoostingClassifier will only take those features in account that you fed it during training and it will then classify each observation on its own. That means that usually you don't have to worry about the target-distribution of your prediction-dataset, as long you trained your model on a sufficiently extensive training dataset.
$endgroup$
1
$begingroup$
Hm, my training data has 35% positive target ratio. If applied to DS2, the model finds ~35% positive ratio there as well, even though I'd expect it to be <5% ...
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
A balanced training dataset has a more or less equal distribution of the contained classes. In your case, this would be a distribution of about 50% positive. There are a lot of methods of how to generate an equal distribution by downsampling or oversampling (see this link)
$endgroup$
– georg_un
Mar 7 at 11:25
$begingroup$
At the moment there are two possibilities: either your expectation regarding the prediction data set is wrong or the classifier is inaccurate. For this reason, it is common practice to divide the labeled data into a training and a test data set (e.g. in the ratio 70:30 or 80:20) and to examine the performance of the classifier on the test data set. I would recommend doing so and then investigating your results further. Don't forget that you can print out feature importances of the GradientBoostingClassifier withfeature_importances_.
$endgroup$
– georg_un
Mar 7 at 11:26
add a comment |
$begingroup$
For prediction, the GradientBoostingClassifier will only take those features in account that you fed it during training and it will then classify each observation on its own. That means that usually you don't have to worry about the target-distribution of your prediction-dataset, as long you trained your model on a sufficiently extensive training dataset.
$endgroup$
For prediction, the GradientBoostingClassifier will only take those features in account that you fed it during training and it will then classify each observation on its own. That means that usually you don't have to worry about the target-distribution of your prediction-dataset, as long you trained your model on a sufficiently extensive training dataset.
answered Mar 5 at 17:21
georg_ungeorg_un
318111
318111
1
$begingroup$
Hm, my training data has 35% positive target ratio. If applied to DS2, the model finds ~35% positive ratio there as well, even though I'd expect it to be <5% ...
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
A balanced training dataset has a more or less equal distribution of the contained classes. In your case, this would be a distribution of about 50% positive. There are a lot of methods of how to generate an equal distribution by downsampling or oversampling (see this link)
$endgroup$
– georg_un
Mar 7 at 11:25
$begingroup$
At the moment there are two possibilities: either your expectation regarding the prediction data set is wrong or the classifier is inaccurate. For this reason, it is common practice to divide the labeled data into a training and a test data set (e.g. in the ratio 70:30 or 80:20) and to examine the performance of the classifier on the test data set. I would recommend doing so and then investigating your results further. Don't forget that you can print out feature importances of the GradientBoostingClassifier withfeature_importances_.
$endgroup$
– georg_un
Mar 7 at 11:26
add a comment |
1
$begingroup$
Hm, my training data has 35% positive target ratio. If applied to DS2, the model finds ~35% positive ratio there as well, even though I'd expect it to be <5% ...
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
A balanced training dataset has a more or less equal distribution of the contained classes. In your case, this would be a distribution of about 50% positive. There are a lot of methods of how to generate an equal distribution by downsampling or oversampling (see this link)
$endgroup$
– georg_un
Mar 7 at 11:25
$begingroup$
At the moment there are two possibilities: either your expectation regarding the prediction data set is wrong or the classifier is inaccurate. For this reason, it is common practice to divide the labeled data into a training and a test data set (e.g. in the ratio 70:30 or 80:20) and to examine the performance of the classifier on the test data set. I would recommend doing so and then investigating your results further. Don't forget that you can print out feature importances of the GradientBoostingClassifier withfeature_importances_.
$endgroup$
– georg_un
Mar 7 at 11:26
1
1
$begingroup$
Hm, my training data has 35% positive target ratio. If applied to DS2, the model finds ~35% positive ratio there as well, even though I'd expect it to be <5% ...
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
Hm, my training data has 35% positive target ratio. If applied to DS2, the model finds ~35% positive ratio there as well, even though I'd expect it to be <5% ...
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
A balanced training dataset has a more or less equal distribution of the contained classes. In your case, this would be a distribution of about 50% positive. There are a lot of methods of how to generate an equal distribution by downsampling or oversampling (see this link)
$endgroup$
– georg_un
Mar 7 at 11:25
$begingroup$
A balanced training dataset has a more or less equal distribution of the contained classes. In your case, this would be a distribution of about 50% positive. There are a lot of methods of how to generate an equal distribution by downsampling or oversampling (see this link)
$endgroup$
– georg_un
Mar 7 at 11:25
$begingroup$
At the moment there are two possibilities: either your expectation regarding the prediction data set is wrong or the classifier is inaccurate. For this reason, it is common practice to divide the labeled data into a training and a test data set (e.g. in the ratio 70:30 or 80:20) and to examine the performance of the classifier on the test data set. I would recommend doing so and then investigating your results further. Don't forget that you can print out feature importances of the GradientBoostingClassifier with
feature_importances_.$endgroup$
– georg_un
Mar 7 at 11:26
$begingroup$
At the moment there are two possibilities: either your expectation regarding the prediction data set is wrong or the classifier is inaccurate. For this reason, it is common practice to divide the labeled data into a training and a test data set (e.g. in the ratio 70:30 or 80:20) and to examine the performance of the classifier on the test data set. I would recommend doing so and then investigating your results further. Don't forget that you can print out feature importances of the GradientBoostingClassifier with
feature_importances_.$endgroup$
– georg_un
Mar 7 at 11:26
add a comment |
$begingroup$
A GBM will ultimately try to split your data into rectangular regions and assign each one a constant predicted probability, the proportion of positive training examples in that region. So yes, on the whole the model has baked in the training sample's average response rate.
I think that effect will be lessened if your data is particularly cleanly separable: if each rectangular region is pure, and your test data just happens to be more heavily inclined toward the negative regions, then it will naturally get closer to "the right" answer.
I'm not sure about other models that would be more robust in this way...an SVM probably, not being naturally probabilistic in the first place.
If your context is downsampling, logistic regression has a well-known adjustment for exactly this problem. The same adjustment (to log-odds) seems likely to help in the GBM context as well, though I'm not aware of any analysis to back it up.
$endgroup$
add a comment |
$begingroup$
A GBM will ultimately try to split your data into rectangular regions and assign each one a constant predicted probability, the proportion of positive training examples in that region. So yes, on the whole the model has baked in the training sample's average response rate.
I think that effect will be lessened if your data is particularly cleanly separable: if each rectangular region is pure, and your test data just happens to be more heavily inclined toward the negative regions, then it will naturally get closer to "the right" answer.
I'm not sure about other models that would be more robust in this way...an SVM probably, not being naturally probabilistic in the first place.
If your context is downsampling, logistic regression has a well-known adjustment for exactly this problem. The same adjustment (to log-odds) seems likely to help in the GBM context as well, though I'm not aware of any analysis to back it up.
$endgroup$
add a comment |
$begingroup$
A GBM will ultimately try to split your data into rectangular regions and assign each one a constant predicted probability, the proportion of positive training examples in that region. So yes, on the whole the model has baked in the training sample's average response rate.
I think that effect will be lessened if your data is particularly cleanly separable: if each rectangular region is pure, and your test data just happens to be more heavily inclined toward the negative regions, then it will naturally get closer to "the right" answer.
I'm not sure about other models that would be more robust in this way...an SVM probably, not being naturally probabilistic in the first place.
If your context is downsampling, logistic regression has a well-known adjustment for exactly this problem. The same adjustment (to log-odds) seems likely to help in the GBM context as well, though I'm not aware of any analysis to back it up.
$endgroup$
A GBM will ultimately try to split your data into rectangular regions and assign each one a constant predicted probability, the proportion of positive training examples in that region. So yes, on the whole the model has baked in the training sample's average response rate.
I think that effect will be lessened if your data is particularly cleanly separable: if each rectangular region is pure, and your test data just happens to be more heavily inclined toward the negative regions, then it will naturally get closer to "the right" answer.
I'm not sure about other models that would be more robust in this way...an SVM probably, not being naturally probabilistic in the first place.
If your context is downsampling, logistic regression has a well-known adjustment for exactly this problem. The same adjustment (to log-odds) seems likely to help in the GBM context as well, though I'm not aware of any analysis to back it up.
answered Mar 7 at 21:23
Ben ReinigerBen Reiniger
458212
458212
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46709%2ftrain-classifier-on-balanced-dataset-and-apply-on-imbalanced-dataset%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
I think this may depend quite a bit on why your training data was balanced but your test data is not.
$endgroup$
– Ben Reiniger
Mar 5 at 22:07
$begingroup$
@BenReiniger, could you elaborate on that?
$endgroup$
– user3240855
Mar 7 at 6:58
$begingroup$
If you know the real distribution (or have an idea of it) you can use it weight training samples. (You could try using an Inner Product Detector or a Kernel Inner Product Detector if you have the probabilities of the classes, I can provide you with the module but I never used it outside of Computer Vision applications, actually no one ever did, KIPD was published 3 months ago)
$endgroup$
– Pedro Henrique Monforte
Apr 6 at 22:12