Imbalanced data causing mis-classification on multiclass datasetHow do you apply SMOTE on text classification?Categorizing Customer EmailsMulticlass Classification with large number of categoriesUnbalanced multiclass data with XGBoostProduct classification in e-commerce using attribute keywordsRandom Forest Multiclass ClassificationHow to structure data and model for multiclass classification in SVM?weighted cross entropy for imbalanced dataset - multiclass classificationMulticlass naive bays classification as probabilistic modelMulticlass classification problem with more prediction classes than real classesMulticlass classification in a balanced dataset with one high-priority label

hline - width of entire table

Why are synthetic pH indicators used over natural indicators?

How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?

Why do we read the Megillah by night and by day?

When were female captains banned from Starfleet?

GraphicsGrid with a Label for each Column and Row

Is there a single word describing earning money through any means?

Why did the EU agree to delay the Brexit deadline?

Can I sign legal documents with a smiley face?

Filling the middle of a torus in Tikz

The Staircase of Paint

Can somebody explain the brexit thing in one or two child-proof sentences?

Has any country ever had 2 former presidents in jail simultaneously?

Could the E-bike drivetrain wear down till needing replacement after 400 km?

Store Credit Card Information in Password Manager?

The screen of my macbook suddenly broken down how can I do to recover

Melting point of aspirin, contradicting sources

What spells are affected by the size of the caster?

Calculating Wattage for Resistor in High Frequency Application?

Multiplicative persistence

Is the U.S. Code copyrighted by the Government?

Can the Supreme Court overturn an impeachment?

Did US corporations pay demonstrators in the German demonstrations against article 13?

How is flyblackbird.com operating under Part 91K?



Imbalanced data causing mis-classification on multiclass dataset


How do you apply SMOTE on text classification?Categorizing Customer EmailsMulticlass Classification with large number of categoriesUnbalanced multiclass data with XGBoostProduct classification in e-commerce using attribute keywordsRandom Forest Multiclass ClassificationHow to structure data and model for multiclass classification in SVM?weighted cross entropy for imbalanced dataset - multiclass classificationMulticlass naive bays classification as probabilistic modelMulticlass classification problem with more prediction classes than real classesMulticlass classification in a balanced dataset with one high-priority label













9












$begingroup$


I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).



Structure or format of my data is as follows.



----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211
Storage:128 GB, RAM:4 GB,Primary Camera:12 MP

Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212

Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
sleeve:half sleeve

Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
Standard Whey Protein


Data distribution is not normal; it is highly imbalanced:



-------------------------
| taxonomy_id | count |
-------------------------
111 | 851750
112 | 355592
113 | 379433
114 | 23138
115 | 117735
116 | 145757
117 | 1339471
121 | 394026
122 | 193433
123 | 78299
124 | 111962
131 | 1776
132 | 4425
133 | 908
134 | 23062
141 | 22713
142 | 42073
211 | 7892
212 | 1574744
221 | 1047
222 | 397515
223 | 53009
231 | 1227
232 | 7683
251 | 739
252 | 327
253 | 38974
254 | 25
311 | 2901
321 | 7126
412 | 856
421 | 697802
422 | 414855
423 | 17750
425 | 1240
427 | 658
429 | 1058
431 | 20760
441 | 257


As you can see they are highly imbalanced and leading to mis-classifications.



Steps I have performed till now



1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.



2) I have used pipeline for TFIDFvectorizer(), LinearSVC()



vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
])


After this I have fit pipeline and stored the classifier in pickle



prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])


On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function



pd = cl.predict([testData])


Issues I am facing




  1. A lot of products are being mis-classified into some other categories



    Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.



  2. I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?


  3. Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?



  4. As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
    Also I am not sure whether this is the right approach to perform random undersampling:



    pipe = make_pipeline_imb(
    HashingVectorizer(lowercase=True),
    RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1),
    OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))


  5. I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.


(It would be great if you give suggestion or solution with examples as it will help me understand better).



***EDIT-1****



RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)

pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])

pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)









share|improve this question











$endgroup$











  • $begingroup$
    I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:20















9












$begingroup$


I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).



Structure or format of my data is as follows.



----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211
Storage:128 GB, RAM:4 GB,Primary Camera:12 MP

Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212

Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
sleeve:half sleeve

Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
Standard Whey Protein


Data distribution is not normal; it is highly imbalanced:



-------------------------
| taxonomy_id | count |
-------------------------
111 | 851750
112 | 355592
113 | 379433
114 | 23138
115 | 117735
116 | 145757
117 | 1339471
121 | 394026
122 | 193433
123 | 78299
124 | 111962
131 | 1776
132 | 4425
133 | 908
134 | 23062
141 | 22713
142 | 42073
211 | 7892
212 | 1574744
221 | 1047
222 | 397515
223 | 53009
231 | 1227
232 | 7683
251 | 739
252 | 327
253 | 38974
254 | 25
311 | 2901
321 | 7126
412 | 856
421 | 697802
422 | 414855
423 | 17750
425 | 1240
427 | 658
429 | 1058
431 | 20760
441 | 257


As you can see they are highly imbalanced and leading to mis-classifications.



Steps I have performed till now



1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.



2) I have used pipeline for TFIDFvectorizer(), LinearSVC()



vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
])


After this I have fit pipeline and stored the classifier in pickle



prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])


On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function



pd = cl.predict([testData])


Issues I am facing




  1. A lot of products are being mis-classified into some other categories



    Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.



  2. I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?


  3. Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?



  4. As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
    Also I am not sure whether this is the right approach to perform random undersampling:



    pipe = make_pipeline_imb(
    HashingVectorizer(lowercase=True),
    RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1),
    OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))


  5. I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.


(It would be great if you give suggestion or solution with examples as it will help me understand better).



***EDIT-1****



RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)

pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])

pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)









share|improve this question











$endgroup$











  • $begingroup$
    I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:20













9












9








9


3



$begingroup$


I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).



Structure or format of my data is as follows.



----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211
Storage:128 GB, RAM:4 GB,Primary Camera:12 MP

Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212

Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
sleeve:half sleeve

Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
Standard Whey Protein


Data distribution is not normal; it is highly imbalanced:



-------------------------
| taxonomy_id | count |
-------------------------
111 | 851750
112 | 355592
113 | 379433
114 | 23138
115 | 117735
116 | 145757
117 | 1339471
121 | 394026
122 | 193433
123 | 78299
124 | 111962
131 | 1776
132 | 4425
133 | 908
134 | 23062
141 | 22713
142 | 42073
211 | 7892
212 | 1574744
221 | 1047
222 | 397515
223 | 53009
231 | 1227
232 | 7683
251 | 739
252 | 327
253 | 38974
254 | 25
311 | 2901
321 | 7126
412 | 856
421 | 697802
422 | 414855
423 | 17750
425 | 1240
427 | 658
429 | 1058
431 | 20760
441 | 257


As you can see they are highly imbalanced and leading to mis-classifications.



Steps I have performed till now



1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.



2) I have used pipeline for TFIDFvectorizer(), LinearSVC()



vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
])


After this I have fit pipeline and stored the classifier in pickle



prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])


On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function



pd = cl.predict([testData])


Issues I am facing




  1. A lot of products are being mis-classified into some other categories



    Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.



  2. I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?


  3. Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?



  4. As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
    Also I am not sure whether this is the right approach to perform random undersampling:



    pipe = make_pipeline_imb(
    HashingVectorizer(lowercase=True),
    RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1),
    OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))


  5. I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.


(It would be great if you give suggestion or solution with examples as it will help me understand better).



***EDIT-1****



RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)

pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])

pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)









share|improve this question











$endgroup$




I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).



Structure or format of my data is as follows.



----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211
Storage:128 GB, RAM:4 GB,Primary Camera:12 MP

Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212

Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
sleeve:half sleeve

Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
Standard Whey Protein


Data distribution is not normal; it is highly imbalanced:



-------------------------
| taxonomy_id | count |
-------------------------
111 | 851750
112 | 355592
113 | 379433
114 | 23138
115 | 117735
116 | 145757
117 | 1339471
121 | 394026
122 | 193433
123 | 78299
124 | 111962
131 | 1776
132 | 4425
133 | 908
134 | 23062
141 | 22713
142 | 42073
211 | 7892
212 | 1574744
221 | 1047
222 | 397515
223 | 53009
231 | 1227
232 | 7683
251 | 739
252 | 327
253 | 38974
254 | 25
311 | 2901
321 | 7126
412 | 856
421 | 697802
422 | 414855
423 | 17750
425 | 1240
427 | 658
429 | 1058
431 | 20760
441 | 257


As you can see they are highly imbalanced and leading to mis-classifications.



Steps I have performed till now



1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.



2) I have used pipeline for TFIDFvectorizer(), LinearSVC()



vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
])


After this I have fit pipeline and stored the classifier in pickle



prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])


On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function



pd = cl.predict([testData])


Issues I am facing




  1. A lot of products are being mis-classified into some other categories



    Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.



  2. I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?


  3. Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?



  4. As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
    Also I am not sure whether this is the right approach to perform random undersampling:



    pipe = make_pipeline_imb(
    HashingVectorizer(lowercase=True),
    RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1),
    OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))


  5. I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.


(It would be great if you give suggestion or solution with examples as it will help me understand better).



***EDIT-1****



RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)

pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])

pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)






machine-learning python classification scikit-learn multiclass-classification






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 19 at 17:00









Blenzus

446




446










asked Feb 16 '18 at 11:09









outlieroutlier

484




484











  • $begingroup$
    I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:20
















  • $begingroup$
    I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:20















$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20




$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20










1 Answer
1






active

oldest

votes


















6












$begingroup$

Nice question!



Some Remarks



For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).



Practical Answer



I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).



TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).



You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:



train = pd.read_csv(...)
test = pd.read_csv(...)

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]


Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.



Hints



Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!



Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).



VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.



Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)



Hope it helps!






share|improve this answer











$endgroup$












  • $begingroup$
    Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:50










  • $begingroup$
    I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
    $endgroup$
    – outlier
    Feb 16 '18 at 12:57










  • $begingroup$
    sure my friend ... good luck!
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:57






  • 1




    $begingroup$
    if it worked then you may accept the answer :)
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:58










  • $begingroup$
    @outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
    $endgroup$
    – desertnaut
    Feb 17 '18 at 18:59










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f27888%2fimbalanced-data-causing-mis-classification-on-multiclass-dataset%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









6












$begingroup$

Nice question!



Some Remarks



For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).



Practical Answer



I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).



TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).



You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:



train = pd.read_csv(...)
test = pd.read_csv(...)

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]


Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.



Hints



Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!



Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).



VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.



Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)



Hope it helps!






share|improve this answer











$endgroup$












  • $begingroup$
    Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:50










  • $begingroup$
    I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
    $endgroup$
    – outlier
    Feb 16 '18 at 12:57










  • $begingroup$
    sure my friend ... good luck!
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:57






  • 1




    $begingroup$
    if it worked then you may accept the answer :)
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:58










  • $begingroup$
    @outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
    $endgroup$
    – desertnaut
    Feb 17 '18 at 18:59















6












$begingroup$

Nice question!



Some Remarks



For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).



Practical Answer



I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).



TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).



You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:



train = pd.read_csv(...)
test = pd.read_csv(...)

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]


Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.



Hints



Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!



Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).



VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.



Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)



Hope it helps!






share|improve this answer











$endgroup$












  • $begingroup$
    Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:50










  • $begingroup$
    I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
    $endgroup$
    – outlier
    Feb 16 '18 at 12:57










  • $begingroup$
    sure my friend ... good luck!
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:57






  • 1




    $begingroup$
    if it worked then you may accept the answer :)
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:58










  • $begingroup$
    @outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
    $endgroup$
    – desertnaut
    Feb 17 '18 at 18:59













6












6








6





$begingroup$

Nice question!



Some Remarks



For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).



Practical Answer



I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).



TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).



You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:



train = pd.read_csv(...)
test = pd.read_csv(...)

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]


Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.



Hints



Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!



Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).



VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.



Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)



Hope it helps!






share|improve this answer











$endgroup$



Nice question!



Some Remarks



For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).



Practical Answer



I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).



TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).



You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:



train = pd.read_csv(...)
test = pd.read_csv(...)

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]


Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.



Hints



Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!



Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).



VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.



Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)



Hope it helps!







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 20 at 11:48









Blenzus

446




446










answered Feb 16 '18 at 12:19









Kasra ManshaeiKasra Manshaei

3,7591035




3,7591035











  • $begingroup$
    Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:50










  • $begingroup$
    I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
    $endgroup$
    – outlier
    Feb 16 '18 at 12:57










  • $begingroup$
    sure my friend ... good luck!
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:57






  • 1




    $begingroup$
    if it worked then you may accept the answer :)
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:58










  • $begingroup$
    @outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
    $endgroup$
    – desertnaut
    Feb 17 '18 at 18:59
















  • $begingroup$
    Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:50










  • $begingroup$
    I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
    $endgroup$
    – outlier
    Feb 16 '18 at 12:57










  • $begingroup$
    sure my friend ... good luck!
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:57






  • 1




    $begingroup$
    if it worked then you may accept the answer :)
    $endgroup$
    – Kasra Manshaei
    Feb 16 '18 at 12:58










  • $begingroup$
    @outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
    $endgroup$
    – desertnaut
    Feb 17 '18 at 18:59















$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50




$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50












$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57




$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57












$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57




$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57




1




1




$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58




$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58












$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59




$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f27888%2fimbalanced-data-causing-mis-classification-on-multiclass-dataset%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High