Imbalanced data causing mis-classification on multiclass datasetHow do you apply SMOTE on text classification?Categorizing Customer EmailsMulticlass Classification with large number of categoriesUnbalanced multiclass data with XGBoostProduct classification in e-commerce using attribute keywordsRandom Forest Multiclass ClassificationHow to structure data and model for multiclass classification in SVM?weighted cross entropy for imbalanced dataset - multiclass classificationMulticlass naive bays classification as probabilistic modelMulticlass classification problem with more prediction classes than real classesMulticlass classification in a balanced dataset with one high-priority label
hline - width of entire table
Why are synthetic pH indicators used over natural indicators?
How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?
Why do we read the Megillah by night and by day?
When were female captains banned from Starfleet?
GraphicsGrid with a Label for each Column and Row
Is there a single word describing earning money through any means?
Why did the EU agree to delay the Brexit deadline?
Can I sign legal documents with a smiley face?
Filling the middle of a torus in Tikz
The Staircase of Paint
Can somebody explain the brexit thing in one or two child-proof sentences?
Has any country ever had 2 former presidents in jail simultaneously?
Could the E-bike drivetrain wear down till needing replacement after 400 km?
Store Credit Card Information in Password Manager?
The screen of my macbook suddenly broken down how can I do to recover
Melting point of aspirin, contradicting sources
What spells are affected by the size of the caster?
Calculating Wattage for Resistor in High Frequency Application?
Multiplicative persistence
Is the U.S. Code copyrighted by the Government?
Can the Supreme Court overturn an impeachment?
Did US corporations pay demonstrators in the German demonstrations against article 13?
How is flyblackbird.com operating under Part 91K?
Imbalanced data causing mis-classification on multiclass dataset
How do you apply SMOTE on text classification?Categorizing Customer EmailsMulticlass Classification with large number of categoriesUnbalanced multiclass data with XGBoostProduct classification in e-commerce using attribute keywordsRandom Forest Multiclass ClassificationHow to structure data and model for multiclass classification in SVM?weighted cross entropy for imbalanced dataset - multiclass classificationMulticlass naive bays classification as probabilistic modelMulticlass classification problem with more prediction classes than real classesMulticlass classification in a balanced dataset with one high-priority label
$begingroup$
I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).
Structure or format of my data is as follows.
----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211
Storage:128 GB, RAM:4 GB,Primary Camera:12 MP
Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212
Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
sleeve:half sleeve
Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
Standard Whey Protein
Data distribution is not normal; it is highly imbalanced:
-------------------------
| taxonomy_id | count |
-------------------------
111 | 851750
112 | 355592
113 | 379433
114 | 23138
115 | 117735
116 | 145757
117 | 1339471
121 | 394026
122 | 193433
123 | 78299
124 | 111962
131 | 1776
132 | 4425
133 | 908
134 | 23062
141 | 22713
142 | 42073
211 | 7892
212 | 1574744
221 | 1047
222 | 397515
223 | 53009
231 | 1227
232 | 7683
251 | 739
252 | 327
253 | 38974
254 | 25
311 | 2901
321 | 7126
412 | 856
421 | 697802
422 | 414855
423 | 17750
425 | 1240
427 | 658
429 | 1058
431 | 20760
441 | 257
As you can see they are highly imbalanced and leading to mis-classifications.
Steps I have performed till now
1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.
2) I have used pipeline for TFIDFvectorizer(), LinearSVC()
vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
])
After this I have fit pipeline and stored the classifier in pickle
prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])
On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function
pd = cl.predict([testData])
Issues I am facing
A lot of products are being mis-classified into some other categories
Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.
I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?
Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?
As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
Also I am not sure whether this is the right approach to perform random undersampling:pipe = make_pipeline_imb(
HashingVectorizer(lowercase=True),
RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1),
OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.
(It would be great if you give suggestion or solution with examples as it will help me understand better).
***EDIT-1****
RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)
pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])
pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)
machine-learning python classification scikit-learn multiclass-classification
$endgroup$
add a comment |
$begingroup$
I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).
Structure or format of my data is as follows.
----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211
Storage:128 GB, RAM:4 GB,Primary Camera:12 MP
Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212
Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
sleeve:half sleeve
Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
Standard Whey Protein
Data distribution is not normal; it is highly imbalanced:
-------------------------
| taxonomy_id | count |
-------------------------
111 | 851750
112 | 355592
113 | 379433
114 | 23138
115 | 117735
116 | 145757
117 | 1339471
121 | 394026
122 | 193433
123 | 78299
124 | 111962
131 | 1776
132 | 4425
133 | 908
134 | 23062
141 | 22713
142 | 42073
211 | 7892
212 | 1574744
221 | 1047
222 | 397515
223 | 53009
231 | 1227
232 | 7683
251 | 739
252 | 327
253 | 38974
254 | 25
311 | 2901
321 | 7126
412 | 856
421 | 697802
422 | 414855
423 | 17750
425 | 1240
427 | 658
429 | 1058
431 | 20760
441 | 257
As you can see they are highly imbalanced and leading to mis-classifications.
Steps I have performed till now
1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.
2) I have used pipeline for TFIDFvectorizer(), LinearSVC()
vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
])
After this I have fit pipeline and stored the classifier in pickle
prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])
On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function
pd = cl.predict([testData])
Issues I am facing
A lot of products are being mis-classified into some other categories
Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.
I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?
Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?
As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
Also I am not sure whether this is the right approach to perform random undersampling:pipe = make_pipeline_imb(
HashingVectorizer(lowercase=True),
RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1),
OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.
(It would be great if you give suggestion or solution with examples as it will help me understand better).
***EDIT-1****
RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)
pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])
pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)
machine-learning python classification scikit-learn multiclass-classification
$endgroup$
$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20
add a comment |
$begingroup$
I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).
Structure or format of my data is as follows.
----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211
Storage:128 GB, RAM:4 GB,Primary Camera:12 MP
Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212
Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
sleeve:half sleeve
Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
Standard Whey Protein
Data distribution is not normal; it is highly imbalanced:
-------------------------
| taxonomy_id | count |
-------------------------
111 | 851750
112 | 355592
113 | 379433
114 | 23138
115 | 117735
116 | 145757
117 | 1339471
121 | 394026
122 | 193433
123 | 78299
124 | 111962
131 | 1776
132 | 4425
133 | 908
134 | 23062
141 | 22713
142 | 42073
211 | 7892
212 | 1574744
221 | 1047
222 | 397515
223 | 53009
231 | 1227
232 | 7683
251 | 739
252 | 327
253 | 38974
254 | 25
311 | 2901
321 | 7126
412 | 856
421 | 697802
422 | 414855
423 | 17750
425 | 1240
427 | 658
429 | 1058
431 | 20760
441 | 257
As you can see they are highly imbalanced and leading to mis-classifications.
Steps I have performed till now
1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.
2) I have used pipeline for TFIDFvectorizer(), LinearSVC()
vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
])
After this I have fit pipeline and stored the classifier in pickle
prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])
On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function
pd = cl.predict([testData])
Issues I am facing
A lot of products are being mis-classified into some other categories
Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.
I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?
Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?
As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
Also I am not sure whether this is the right approach to perform random undersampling:pipe = make_pipeline_imb(
HashingVectorizer(lowercase=True),
RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1),
OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.
(It would be great if you give suggestion or solution with examples as it will help me understand better).
***EDIT-1****
RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)
pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])
pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)
machine-learning python classification scikit-learn multiclass-classification
$endgroup$
I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).
Structure or format of my data is as follows.
----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211
Storage:128 GB, RAM:4 GB,Primary Camera:12 MP
Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212
Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
sleeve:half sleeve
Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
Standard Whey Protein
Data distribution is not normal; it is highly imbalanced:
-------------------------
| taxonomy_id | count |
-------------------------
111 | 851750
112 | 355592
113 | 379433
114 | 23138
115 | 117735
116 | 145757
117 | 1339471
121 | 394026
122 | 193433
123 | 78299
124 | 111962
131 | 1776
132 | 4425
133 | 908
134 | 23062
141 | 22713
142 | 42073
211 | 7892
212 | 1574744
221 | 1047
222 | 397515
223 | 53009
231 | 1227
232 | 7683
251 | 739
252 | 327
253 | 38974
254 | 25
311 | 2901
321 | 7126
412 | 856
421 | 697802
422 | 414855
423 | 17750
425 | 1240
427 | 658
429 | 1058
431 | 20760
441 | 257
As you can see they are highly imbalanced and leading to mis-classifications.
Steps I have performed till now
1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.
2) I have used pipeline for TFIDFvectorizer(), LinearSVC()
vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
])
After this I have fit pipeline and stored the classifier in pickle
prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])
On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function
pd = cl.predict([testData])
Issues I am facing
A lot of products are being mis-classified into some other categories
Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.
I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?
Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?
As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
Also I am not sure whether this is the right approach to perform random undersampling:pipe = make_pipeline_imb(
HashingVectorizer(lowercase=True),
RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1),
OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.
(It would be great if you give suggestion or solution with examples as it will help me understand better).
***EDIT-1****
RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)
pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])
pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)
machine-learning python classification scikit-learn multiclass-classification
machine-learning python classification scikit-learn multiclass-classification
edited Mar 19 at 17:00
Blenzus
446
446
asked Feb 16 '18 at 11:09
outlieroutlier
484
484
$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20
add a comment |
$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20
$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20
$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Nice question!
Some Remarks
For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).
Practical Answer
I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).
TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).
You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:
train = pd.read_csv(...)
test = pd.read_csv(...)
# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)
# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]
Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.
Hints
Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!
Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).
VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.
Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)
Hope it helps!
$endgroup$
$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50
$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57
$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57
1
$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58
$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59
|
show 8 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f27888%2fimbalanced-data-causing-mis-classification-on-multiclass-dataset%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Nice question!
Some Remarks
For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).
Practical Answer
I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).
TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).
You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:
train = pd.read_csv(...)
test = pd.read_csv(...)
# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)
# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]
Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.
Hints
Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!
Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).
VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.
Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)
Hope it helps!
$endgroup$
$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50
$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57
$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57
1
$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58
$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59
|
show 8 more comments
$begingroup$
Nice question!
Some Remarks
For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).
Practical Answer
I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).
TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).
You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:
train = pd.read_csv(...)
test = pd.read_csv(...)
# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)
# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]
Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.
Hints
Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!
Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).
VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.
Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)
Hope it helps!
$endgroup$
$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50
$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57
$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57
1
$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58
$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59
|
show 8 more comments
$begingroup$
Nice question!
Some Remarks
For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).
Practical Answer
I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).
TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).
You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:
train = pd.read_csv(...)
test = pd.read_csv(...)
# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)
# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]
Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.
Hints
Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!
Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).
VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.
Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)
Hope it helps!
$endgroup$
Nice question!
Some Remarks
For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).
Practical Answer
I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).
TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).
You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:
train = pd.read_csv(...)
test = pd.read_csv(...)
# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)
# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]
Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.
Hints
Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!
Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).
VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.
Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)
Hope it helps!
edited Mar 20 at 11:48
Blenzus
446
446
answered Feb 16 '18 at 12:19
Kasra ManshaeiKasra Manshaei
3,7591035
3,7591035
$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50
$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57
$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57
1
$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58
$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59
|
show 8 more comments
$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50
$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57
$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57
1
$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58
$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59
$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50
$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50
$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57
$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57
$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57
$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57
1
1
$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58
$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58
$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59
$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59
|
show 8 more comments
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f27888%2fimbalanced-data-causing-mis-classification-on-multiclass-dataset%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20