Very large number of classes for text classification in keras (and not multi-label classsification)2019 Community Moderator ElectionText classification with thousands of output classes in KerasMulti-class text classification with LSTM in KerasTensorflow regression predicting 1 for all inputsHow does keras calculate accuracy for multi label classification?Multi-task learning for Multi-label classification?Organization of layers in Keras for a NLP problemMulti task learning architecture for Multi-label classificationMulti-label classifciation: keras custom metricsMulti Class Classification on large dataset with over 600 classesIN CIFAR 10 DATASETMulti label classification and sigmoid function

Mathematica command that allows it to read my intentions

Ambiguity in the definition of entropy

Personal Teleportation: From Rags to Riches

How to prevent "they're falling in love" trope

Forgetting the musical notes while performing in concert

Is it acceptable for a professor to tell male students to not think that they are smarter than female students?

Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?

What's the point of deactivating Num Lock on login screens?

Difference between sprint backlog and sprint goal?

What method can I use to design a dungeon difficult enough that the PCs can't make it through without killing them?

90's TV series where a boy goes to another dimension through portal near power lines

Why does ы have a soft sign in it?

Why are the 737's rear doors unusable in a water landing?

What about the virus in 12 Monkeys?

Dealing with conflict between co-workers for non-work-related issue affecting their work

Asymptotics of orbits on graphs

Alternative to sending password over mail?

Little known, relatively unlikely, but scientifically plausible, apocalyptic (or near apocalyptic) events

Can I run a new neutral wire to repair a broken circuit?

What is the difference between 仮定 and 想定?

Assassin's bullet with mercury

How can saying a song's name be a copyright violation?

Saudi Arabia Transit Visa

ssTTsSTtRrriinInnnnNNNIiinngg

Very large number of classes for text classification in keras (and not multi-label classsification)

2019 Community Moderator ElectionText classification with thousands of output classes in KerasMulti-class text classification with LSTM in KerasTensorflow regression predicting 1 for all inputsHow does keras calculate accuracy for multi label classification?Multi-task learning for Multi-label classification?Organization of layers in Keras for a NLP problemMulti task learning architecture for Multi-label classificationMulti-label classifciation: keras custom metricsMulti Class Classification on large dataset with over 600 classesIN CIFAR 10 DATASETMulti label classification and sigmoid function

I am trying to apply text classification with keras. Previously, I used Random Forest and lightgbm with accurancy score arround 65%. Then, I made some first attempts with neural networks, to improve my score, but the results are really bad (score less than 20%).

After searching, I have found approaches for multi-label classification when trying to predict multiple labeles for each input texte, such as the following

multi-label text classification

But this not at all my case!

My dataset is not extremely unbalanced but I have hundreds of classes to predict. I want to predict only one class per text.

After many attempts to find the right values to tune the neural network, I wonder if NN actually do not performe well in this case? Why the Random Forest out performs with the minimum hyper-parametre tuning? Maybe I should transform my problem into a multi-label binary classification?

This is a code example, taken from kaggle:

data = pd.read_csv('multiClassExample.csv', delimiter=';', encoding="latin-1")


enc=LabelEncoder()
enc.fit(data['typeChord'])
data['typeChord']=enc.transform(data['typeChord']) 
labels = to_categorical(data['typeChord'], num_classes=len(data.typeChord.unique()))

n_most_common_words = 8000
max_len = 130
tokenizer = Tokenizer(num_words=n_most_common_words, lower=True)
tokenizer.fit_on_texts(data['text'].values)
sequences = tokenizer.texts_to_sequences(data['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

X = pad_sequences(sequences, maxlen=max_len) 
X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)

epochs = 20
emb_dim = 128
batch_size = 256
print(labels[:2]) 

print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))

model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.7))
model.add(LSTM(64, dropout=0.7, recurrent_dropout=0.7))
model.add(Dense(len(data.typeChord.unique()), activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
print(model.summary())
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',patience=7, min_delta=0.0001)]) 

accr = model.evaluate(X_test,y_test)
print('Test setn Loss: :0.3fn Accuracy: :0.3f'.format(accr[0],accr[1]))

asked Mar 26 at 16:26

user2307229

add a comment |

After searching, I have found approaches for multi-label classification when trying to predict multiple labeles for each input texte, such as the following

multi-label text classification

But this not at all my case!

My dataset is not extremely unbalanced but I have hundreds of classes to predict. I want to predict only one class per text.

This is a code example, taken from kaggle:

data = pd.read_csv('multiClassExample.csv', delimiter=';', encoding="latin-1")


enc=LabelEncoder()
enc.fit(data['typeChord'])
data['typeChord']=enc.transform(data['typeChord']) 
labels = to_categorical(data['typeChord'], num_classes=len(data.typeChord.unique()))

n_most_common_words = 8000
max_len = 130
tokenizer = Tokenizer(num_words=n_most_common_words, lower=True)
tokenizer.fit_on_texts(data['text'].values)
sequences = tokenizer.texts_to_sequences(data['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

X = pad_sequences(sequences, maxlen=max_len) 
X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)

epochs = 20
emb_dim = 128
batch_size = 256
print(labels[:2]) 

print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))

model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.7))
model.add(LSTM(64, dropout=0.7, recurrent_dropout=0.7))
model.add(Dense(len(data.typeChord.unique()), activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
print(model.summary())
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',patience=7, min_delta=0.0001)]) 

accr = model.evaluate(X_test,y_test)
print('Test setn Loss: :0.3fn Accuracy: :0.3f'.format(accr[0],accr[1]))

asked Mar 26 at 16:26

user2307229

add a comment |

After searching, I have found approaches for multi-label classification when trying to predict multiple labeles for each input texte, such as the following

multi-label text classification

But this not at all my case!

My dataset is not extremely unbalanced but I have hundreds of classes to predict. I want to predict only one class per text.

This is a code example, taken from kaggle:

data = pd.read_csv('multiClassExample.csv', delimiter=';', encoding="latin-1")


enc=LabelEncoder()
enc.fit(data['typeChord'])
data['typeChord']=enc.transform(data['typeChord']) 
labels = to_categorical(data['typeChord'], num_classes=len(data.typeChord.unique()))

n_most_common_words = 8000
max_len = 130
tokenizer = Tokenizer(num_words=n_most_common_words, lower=True)
tokenizer.fit_on_texts(data['text'].values)
sequences = tokenizer.texts_to_sequences(data['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

X = pad_sequences(sequences, maxlen=max_len) 
X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)

epochs = 20
emb_dim = 128
batch_size = 256
print(labels[:2]) 

print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))

model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.7))
model.add(LSTM(64, dropout=0.7, recurrent_dropout=0.7))
model.add(Dense(len(data.typeChord.unique()), activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
print(model.summary())
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',patience=7, min_delta=0.0001)]) 

accr = model.evaluate(X_test,y_test)
print('Test setn Loss: :0.3fn Accuracy: :0.3f'.format(accr[0],accr[1]))

asked Mar 26 at 16:26

user2307229

After searching, I have found approaches for multi-label classification when trying to predict multiple labeles for each input texte, such as the following

multi-label text classification

But this not at all my case!

My dataset is not extremely unbalanced but I have hundreds of classes to predict. I want to predict only one class per text.

This is a code example, taken from kaggle:

data = pd.read_csv('multiClassExample.csv', delimiter=';', encoding="latin-1")


enc=LabelEncoder()
enc.fit(data['typeChord'])
data['typeChord']=enc.transform(data['typeChord']) 
labels = to_categorical(data['typeChord'], num_classes=len(data.typeChord.unique()))

n_most_common_words = 8000
max_len = 130
tokenizer = Tokenizer(num_words=n_most_common_words, lower=True)
tokenizer.fit_on_texts(data['text'].values)
sequences = tokenizer.texts_to_sequences(data['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

X = pad_sequences(sequences, maxlen=max_len) 
X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)

epochs = 20
emb_dim = 128
batch_size = 256
print(labels[:2]) 

print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))

model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.7))
model.add(LSTM(64, dropout=0.7, recurrent_dropout=0.7))
model.add(Dense(len(data.typeChord.unique()), activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
print(model.summary())
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',patience=7, min_delta=0.0001)]) 

accr = model.evaluate(X_test,y_test)
print('Test setn Loss: :0.3fn Accuracy: :0.3f'.format(accr[0],accr[1]))

neural-network keras random-forest multiclass-classification

asked Mar 26 at 16:26

user2307229

asked Mar 26 at 16:26

user2307229

asked Mar 26 at 16:26

user2307229

asked Mar 26 at 16:26

user2307229

asked Mar 26 at 16:26

user2307229

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48042%2fvery-large-number-of-classes-for-text-classification-in-keras-and-not-multi-lab%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog