Imbalanced data causing mis-classification on multiclass datasetHow do you apply SMOTE on text classification?Categorizing Customer EmailsMulticlass Classification with large number of categoriesUnbalanced multiclass data with XGBoostProduct classification in e-commerce using attribute keywordsRandom Forest Multiclass ClassificationHow to structure data and model for multiclass classification in SVM?weighted cross entropy for imbalanced dataset - multiclass classificationMulticlass naive bays classification as probabilistic modelMulticlass classification problem with more prediction classes than real classesMulticlass classification in a balanced dataset with one high-priority label

hline - width of entire table

Why are synthetic pH indicators used over natural indicators?

How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?

Why do we read the Megillah by night and by day?

When were female captains banned from Starfleet?

GraphicsGrid with a Label for each Column and Row

Is there a single word describing earning money through any means?

Why did the EU agree to delay the Brexit deadline?

Can I sign legal documents with a smiley face?

Filling the middle of a torus in Tikz

The Staircase of Paint

Can somebody explain the brexit thing in one or two child-proof sentences?

Has any country ever had 2 former presidents in jail simultaneously?

Could the E-bike drivetrain wear down till needing replacement after 400 km?

Store Credit Card Information in Password Manager?

The screen of my macbook suddenly broken down how can I do to recover

Melting point of aspirin, contradicting sources

What spells are affected by the size of the caster?

Calculating Wattage for Resistor in High Frequency Application?

Multiplicative persistence

Is the U.S. Code copyrighted by the Government?

Can the Supreme Court overturn an impeachment?

Did US corporations pay demonstrators in the German demonstrations against article 13?

How is flyblackbird.com operating under Part 91K?

Imbalanced data causing mis-classification on multiclass dataset

How do you apply SMOTE on text classification?Categorizing Customer EmailsMulticlass Classification with large number of categoriesUnbalanced multiclass data with XGBoostProduct classification in e-commerce using attribute keywordsRandom Forest Multiclass ClassificationHow to structure data and model for multiclass classification in SVM?weighted cross entropy for imbalanced dataset - multiclass classificationMulticlass naive bays classification as probabilistic modelMulticlass classification problem with more prediction classes than real classesMulticlass classification in a balanced dataset with one high-priority label

I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).

Structure or format of my data is as follows.

----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
 Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211 
 Storage:128 GB, RAM:4 GB,Primary Camera:12 MP 

 Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212 

 Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
 sleeve:half sleeve

 Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
 Standard Whey Protein

Data distribution is not normal; it is highly imbalanced:

-------------------------
| taxonomy_id | count |
-------------------------
 111 | 851750 
 112 | 355592
 113 | 379433
 114 | 23138
 115 | 117735
 116 | 145757
 117 | 1339471
 121 | 394026
 122 | 193433
 123 | 78299
 124 | 111962
 131 | 1776
 132 | 4425
 133 | 908
 134 | 23062
 141 | 22713
 142 | 42073
 211 | 7892
 212 | 1574744
 221 | 1047
 222 | 397515
 223 | 53009
 231 | 1227
 232 | 7683
 251 | 739
 252 | 327
 253 | 38974
 254 | 25
 311 | 2901
 321 | 7126
 412 | 856
 421 | 697802
 422 | 414855
 423 | 17750
 425 | 1240
 427 | 658
 429 | 1058
 431 | 20760
 441 | 257

As you can see they are highly imbalanced and leading to mis-classifications.

Steps I have performed till now

1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.

2) I have used pipeline for TFIDFvectorizer(), LinearSVC()

vectorizerPipe = Pipeline([
 ('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
 ('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
 ])

After this I have fit pipeline and stored the classifier in pickle

prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])

On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function

pd = cl.predict([testData])

Issues I am facing

A lot of products are being mis-classified into some other categories

Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.

I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?

Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?

As my data is highly imbalanced I tried random undersampling. The results were improved but they were still not up to the mark.
Also I am not sure whether this is the right approach to perform random undersampling:

pipe = make_pipeline_imb(
 HashingVectorizer(lowercase=True),
 RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1), 
 OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))

I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.

(It would be great if you give suggestion or solution with examples as it will help me understand better).

***EDIT-1****

RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)

pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])

pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)

edited Mar 19 at 17:00

Blenzus

446

asked Feb 16 '18 at 11:09

outlier

484

$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20

add a comment |

I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).

Structure or format of my data is as follows.

----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
 Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211 
 Storage:128 GB, RAM:4 GB,Primary Camera:12 MP 

 Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212 

 Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
 sleeve:half sleeve

 Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
 Standard Whey Protein

Data distribution is not normal; it is highly imbalanced:

-------------------------
| taxonomy_id | count |
-------------------------
 111 | 851750 
 112 | 355592
 113 | 379433
 114 | 23138
 115 | 117735
 116 | 145757
 117 | 1339471
 121 | 394026
 122 | 193433
 123 | 78299
 124 | 111962
 131 | 1776
 132 | 4425
 133 | 908
 134 | 23062
 141 | 22713
 142 | 42073
 211 | 7892
 212 | 1574744
 221 | 1047
 222 | 397515
 223 | 53009
 231 | 1227
 232 | 7683
 251 | 739
 252 | 327
 253 | 38974
 254 | 25
 311 | 2901
 321 | 7126
 412 | 856
 421 | 697802
 422 | 414855
 423 | 17750
 425 | 1240
 427 | 658
 429 | 1058
 431 | 20760
 441 | 257

As you can see they are highly imbalanced and leading to mis-classifications.

Steps I have performed till now

1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.

2) I have used pipeline for TFIDFvectorizer(), LinearSVC()

vectorizerPipe = Pipeline([
 ('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
 ('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
 ])

After this I have fit pipeline and stored the classifier in pickle

prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])

On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function

pd = cl.predict([testData])

Issues I am facing

A lot of products are being mis-classified into some other categories

Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.

I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?

Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?

pipe = make_pipeline_imb(
 HashingVectorizer(lowercase=True),
 RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1), 
 OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))

I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.

(It would be great if you give suggestion or solution with examples as it will help me understand better).

***EDIT-1****

RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)

pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])

pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)

edited Mar 19 at 17:00

Blenzus

446

asked Feb 16 '18 at 11:09

outlier

484

$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20

add a comment |

I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).

Structure or format of my data is as follows.

----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
 Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211 
 Storage:128 GB, RAM:4 GB,Primary Camera:12 MP 

 Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212 

 Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
 sleeve:half sleeve

 Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
 Standard Whey Protein

Data distribution is not normal; it is highly imbalanced:

-------------------------
| taxonomy_id | count |
-------------------------
 111 | 851750 
 112 | 355592
 113 | 379433
 114 | 23138
 115 | 117735
 116 | 145757
 117 | 1339471
 121 | 394026
 122 | 193433
 123 | 78299
 124 | 111962
 131 | 1776
 132 | 4425
 133 | 908
 134 | 23062
 141 | 22713
 142 | 42073
 211 | 7892
 212 | 1574744
 221 | 1047
 222 | 397515
 223 | 53009
 231 | 1227
 232 | 7683
 251 | 739
 252 | 327
 253 | 38974
 254 | 25
 311 | 2901
 321 | 7126
 412 | 856
 421 | 697802
 422 | 414855
 423 | 17750
 425 | 1240
 427 | 658
 429 | 1058
 431 | 20760
 441 | 257

As you can see they are highly imbalanced and leading to mis-classifications.

Steps I have performed till now

1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.

2) I have used pipeline for TFIDFvectorizer(), LinearSVC()

vectorizerPipe = Pipeline([
 ('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
 ('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
 ])

After this I have fit pipeline and stored the classifier in pickle

prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])

On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function

pd = cl.predict([testData])

Issues I am facing

A lot of products are being mis-classified into some other categories

Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.

I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?

Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?

pipe = make_pipeline_imb(
 HashingVectorizer(lowercase=True),
 RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1), 
 OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))

I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.

(It would be great if you give suggestion or solution with examples as it will help me understand better).

***EDIT-1****

RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)

pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])

pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)

edited Mar 19 at 17:00

Blenzus

446

asked Feb 16 '18 at 11:09

outlier

484

I am working on text classification where I have 39 categories/classes and 8.5 million records. (In future data and categories will increase).

Structure or format of my data is as follows.

----------------------------------------------------------------------------------------
| product_title | Key_value_pairs | taxonomy_id |
----------------------------------------------------------------------------------------
 Samsung S7 Edge | Color:black,Display Size:5.5 inch,Internal | 211 
 Storage:128 GB, RAM:4 GB,Primary Camera:12 MP 

 Case cover Honor 8 | Color:transparent,Height:15 mm,width:22 mm | 212 

 Ruggers Men's T-Shirt | Size:L,ideal for:men,fit:regular, | 111
 sleeve:half sleeve

 Optimum Nutrition Gold | Flavor:chocolate,form:powder,size:34 gm | 311
 Standard Whey Protein

Data distribution is not normal; it is highly imbalanced:

-------------------------
| taxonomy_id | count |
-------------------------
 111 | 851750 
 112 | 355592
 113 | 379433
 114 | 23138
 115 | 117735
 116 | 145757
 117 | 1339471
 121 | 394026
 122 | 193433
 123 | 78299
 124 | 111962
 131 | 1776
 132 | 4425
 133 | 908
 134 | 23062
 141 | 22713
 142 | 42073
 211 | 7892
 212 | 1574744
 221 | 1047
 222 | 397515
 223 | 53009
 231 | 1227
 232 | 7683
 251 | 739
 252 | 327
 253 | 38974
 254 | 25
 311 | 2901
 321 | 7126
 412 | 856
 421 | 697802
 422 | 414855
 423 | 17750
 425 | 1240
 427 | 658
 429 | 1058
 431 | 20760
 441 | 257

As you can see they are highly imbalanced and leading to mis-classifications.

Steps I have performed till now

1) Merge product_title and key_value_pairs column and remove stop words and special characters and perform stemming.

2) I have used pipeline for TFIDFvectorizer(), LinearSVC()

vectorizerPipe = Pipeline([
 ('tfidf', TfidfVectorizer(lowercase=True, stop_words='english')),
 ('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),
 ])

After this I have fit pipeline and stored the classifier in pickle

prd = vectorizerPipe.fit(df.loc[:, 'description'], df.loc[:, 'taxonomy_id'])

On Testing side I have repeated step 1 as mentioned above and then load the pickle and use predict function

pd = cl.predict([testData])

Issues I am facing

A lot of products are being mis-classified into some other categories

Example: Ultimate Nutrition Prostar 100% Whey Protein should be classified into category 311 but my classifier is classifying it as 222 which is completely wrong.

I am not sure whether to use TFidfVectorizer() or Hashingvectorizer(), can you guys help me in selecting one of this along with their parameters?

Algorithm I am using is LinearSVC, is it a good choice for multi-class classification problems with large amount of data? Or should I use different algorithms?

pipe = make_pipeline_imb(
 HashingVectorizer(lowercase=True),
 RandomUnderSampler(ratio=111: 405805, 112: 170431, 113: 241709, 114: 8341, 115: 50328, 116: 89445, 117: 650020, 121: 320803, 122: 162557, 123: 66156, 124: 36276, 131: 1196, 132: 3365, 133: 818, 134: 15001, 141: 6145, 142: 31783, 211: 24728, 212: 100000, 221: 791, 222: 8000, 223: 35406, 231: 785, 232: 3000, 251: 477, 252: 127, 253: 29563, 254: 33, 311: 2072, 321: 5370, 412: 652, 421: 520973, 422: 99171, 423: 16786, 425: 730, 427: 198, 429: 1249, 431: 13793, 441: 160,random_state=1), 
 OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge')))

I am new in machine learning so I have used this approach for text classification. If my approach is wrong then please correct me with right one.

(It would be great if you give suggestion or solution with examples as it will help me understand better).

***EDIT-1****

RndmFrst = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)
LogReg = LogisticRegression()
voting = VotingClassifier(estimators=[('LogReg ', LogReg), ('RndmFrst', RndmFrst)], voting='soft', n_jobs=-1)

pipe = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,4), max_features=50000)), ('clf', voting)])

pipe = pipe.fit(df.loc[:,'description'], df.loc[:,'taxonomy_id'])
Preds = pipe.predict(test_data)

machine-learning python classification scikit-learn multiclass-classification

edited Mar 19 at 17:00

Blenzus

446

asked Feb 16 '18 at 11:09

outlier

484

edited Mar 19 at 17:00

Blenzus

446

asked Feb 16 '18 at 11:09

outlier

484

edited Mar 19 at 17:00

Blenzus

446

edited Mar 19 at 17:00

Blenzus

446

edited Mar 19 at 17:00

Blenzus

446

asked Feb 16 '18 at 11:09

outlier

484

asked Feb 16 '18 at 11:09

outlier

484

asked Feb 16 '18 at 11:09

outlier

484

$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20

add a comment |

$begingroup$
I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:20

I just saw that you tried under-sampling. Just fyi, Startified K-fold cross validation in Sci-Kit Learn also takes class distribution into account.

– Kasra Manshaei
Feb 16 '18 at 12:20

add a comment |

1 Answer
1

active

oldest

votes

Nice question!

Some Remarks

For imbalanced data you have different approaches. Most well-established one is resampling (Oversampling small classes /underssampling large classes). The other one is to make your classification hierarchical i.e. classify large classes against all others and then classify small classes in second step (classifiers are not supposed to be the same. Try model selection strategies to find the best).

Practical Answer

I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).

TFIDF is good for such a problem. Classifiers should be selected through model selection but my experience shows that Logistic Regression and Random Forest work well on this specific problem (however it's just a practical experience).

You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:

train = pd.read_csv(...)
test = pd.read_csv(...) 

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]

Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.

Hints

Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!

Stop words are corpus-sensitive so You need to remove stopwords according to information theoretic concepts (to keep it simple you need to know TFIDF kind of ignores your corpus-specific stopwords. If you need more explanation please let me know to update my answer).

VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.

Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)

Hope it helps!

edited Mar 20 at 11:48

Blenzus

446

answered Feb 16 '18 at 12:19

Kasra Manshaei

3,7591035

$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50

$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57

$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57

1

$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58

$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59

|
show 8 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f27888%2fimbalanced-data-causing-mis-classification-on-multiclass-dataset%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Nice question!

Some Remarks

Practical Answer

I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).

You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:

train = pd.read_csv(...)
test = pd.read_csv(...) 

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]

Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.

Hints

Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!

VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.

Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)

Hope it helps!

edited Mar 20 at 11:48

Blenzus

446

answered Feb 16 '18 at 12:19

Kasra Manshaei

3,7591035

$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50

$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57

$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57

1

$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58

$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59

|
show 8 more comments

Nice question!

Some Remarks

Practical Answer

I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).

You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:

train = pd.read_csv(...)
test = pd.read_csv(...) 

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]

Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.

Hints

Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!

VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.

Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)

Hope it helps!

edited Mar 20 at 11:48

Blenzus

446

answered Feb 16 '18 at 12:19

Kasra Manshaei

3,7591035

$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50

$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57

$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57

1

$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58

$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59

|
show 8 more comments

Nice question!

Some Remarks

Practical Answer

I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).

You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:

train = pd.read_csv(...)
test = pd.read_csv(...) 

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]

Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.

Hints

Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!

VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.

Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)

Hope it helps!

edited Mar 20 at 11:48

Blenzus

446

answered Feb 16 '18 at 12:19

Kasra Manshaei

3,7591035

Nice question!

Some Remarks

Practical Answer

I have got acceptable results without resampling the data! So try it but later improve it using resampling methods (statistically they are kind of A MUST).

You may follow the code bellow as it worked simply well then you may try modifying it to improve your results:

train = pd.read_csv(...)
test = pd.read_csv(...) 

# TFIDF Bag Of Words Model For Text Curpos. Up to 4-grams and 50k Features
vec = TfidfVectorizer(ngram_range=(1,4), max_features=50000)
TrainX = vec.fit_transform(train)
TestX = vec.transform(test)


# Initializing Base Estimators
clf1 = LogisticRegression()
clf2 = RandomForestClassifier(n_estimators=100, max_depth=20, max_features=5000,n_jobs=-1)

# Soft Voting Classifier For Each Column
clf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft', n_jobs=-1)
clf = clf.fit(TrainX, TrainY)
preds = clf.predict_proba(TestX)[:,1]

Please note that the code is abstract so TianX, TrainY,TestX,etc should be properly defined by you.

Hints

Be careful about what is StopWord. Practically many people (including myself!) made this mistake to remove stop words according to pre-defined lists. That is not right!

VotingClassifier is a meta-learning strategy in the family of Ensemble Methods. They take benefit from different classifiers. Try them as they work pretty well in practice.

Voting schema simply takes the results of different classifiers and return the output of the one which has the highest probability to be right. So kind of democratic approach against dictatorship ;)

Hope it helps!

edited Mar 20 at 11:48

Blenzus

446

answered Feb 16 '18 at 12:19

Kasra Manshaei

3,7591035

edited Mar 20 at 11:48

Blenzus

446

edited Mar 20 at 11:48

Blenzus

446

edited Mar 20 at 11:48

Blenzus

446

answered Feb 16 '18 at 12:19

Kasra Manshaei

3,7591035

answered Feb 16 '18 at 12:19

Kasra Manshaei

3,7591035

answered Feb 16 '18 at 12:19

Kasra Manshaei

3,7591035

$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50

$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57

$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57

1

$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58

$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59

|
show 8 more comments

$begingroup$
Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:50

$begingroup$
I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!
$endgroup$
– outlier
Feb 16 '18 at 12:57

$begingroup$
sure my friend ... good luck!
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:57

1

$begingroup$
if it worked then you may accept the answer :)
$endgroup$
– Kasra Manshaei
Feb 16 '18 at 12:58

$begingroup$
@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents
$endgroup$
– desertnaut
Feb 17 '18 at 18:59

Welcome! For intuitive resampling you may refer to the link i put for resampling. There is a step-by-step instruction.

– Kasra Manshaei
Feb 16 '18 at 12:50

I am trying your solution, if i got stuck anywhere or in case of any doubts i will post in comment section. hope that will be fine for you!

– outlier
Feb 16 '18 at 12:57

sure my friend ... good luck!

– Kasra Manshaei
Feb 16 '18 at 12:57

if it worked then you may accept the answer :)

– Kasra Manshaei
Feb 16 '18 at 12:58

@outlier since the answer has addressed your issue, kindly accept (and possibly upvote) it; answers take up valuable time for (volunteer) respondents

– desertnaut
Feb 17 '18 at 18:59

|
show 8 more comments

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

dV62u,M5qPRWG w1zIrB

搜尋此網誌

Trjtdtk

1 Answer
1

Some Remarks

Practical Answer

Hints

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Some Remarks

Practical Answer

Hints

Some Remarks

Practical Answer

Hints

Some Remarks

Practical Answer

Hints

Some Remarks

Practical Answer

Hints

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Some Remarks

Practical Answer

Hints

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Some Remarks

Practical Answer

Hints

Some Remarks

Practical Answer

Hints

Some Remarks

Practical Answer

Hints

Some Remarks

Practical Answer

Hints

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1