Why does all of NLP literature use Noise contrastive estimation loss for negative sampling instead of sampled softmax loss?2019 Community Moderator ElectionIntuitive explanation of Noise Contrastive Estimation (NCE) loss?Why Logistic regression into Spark Mllib does not use Maximum likelihood estimation?Extract density estimator from discriminator (MLE method) in adversarial trainingWord classification (not text classification) using NLP

Are there any other methods to apply to solving simultaneous equations?

Does light intensity oscillate really fast since it is a wave?

Denied boarding due to overcrowding, Sparpreis ticket. What are my rights?

Is there a familial term for apples and pears?

Is Social Media Science Fiction?

What is the steepest angle that a canal can be traversable without locks?

"listening to me about as much as you're listening to this pole here"

If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?

Are objects structures and/or vice versa?

Is there a name of the flying bionic bird?

Does the average primeness of natural numbers tend to zero?

Is ipsum/ipsa/ipse a third person pronoun, or can it serve other functions?

Ideas for 3rd eye abilities

What do you call something that goes against the spirit of the law, but is legal when interpreting the law to the letter?

Landing in very high winds

How to move the player while also allowing forces to affect it

What do the Banks children have against barley water?

Copycat chess is back

Why do UK politicians seemingly ignore opinion polls on Brexit?

Why doesn't a const reference extend the life of a temporary object passed via a function?

How to deal with fear of taking dependencies

Pristine Bit Checking

Patience, young "Padovan"

How to answer pointed "are you quitting" questioning when I don't want them to suspect

Why does all of NLP literature use Noise contrastive estimation loss for negative sampling instead of sampled softmax loss?

2019 Community Moderator ElectionIntuitive explanation of Noise Contrastive Estimation (NCE) loss?Why Logistic regression into Spark Mllib does not use Maximum likelihood estimation?Extract density estimator from discriminator (MLE method) in adversarial trainingWord classification (not text classification) using NLP

A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.

This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.

More details here

https://www.tensorflow.org/extras/candidate_sampling.pdf

I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.

Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.

asked Mar 28 at 20:44

SantoshGupta7

1134

add a comment |

A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.

More details here

https://www.tensorflow.org/extras/candidate_sampling.pdf

I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.

asked Mar 28 at 20:44

SantoshGupta7

1134

add a comment |

A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.

More details here

https://www.tensorflow.org/extras/candidate_sampling.pdf

I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.

asked Mar 28 at 20:44

SantoshGupta7

1134

A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.

More details here

https://www.tensorflow.org/extras/candidate_sampling.pdf

I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.

machine-learning nlp word2vec word-embeddings

asked Mar 28 at 20:44

SantoshGupta7

1134

asked Mar 28 at 20:44

SantoshGupta7

1134

asked Mar 28 at 20:44

SantoshGupta7

1134

asked Mar 28 at 20:44

SantoshGupta7

1134

asked Mar 28 at 20:44

SantoshGupta7

1134

add a comment |

1 Answer
1

active

oldest

votes

Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.

The main problem comes from this comment in the linked pdf:

Sampled Softmax

(A faster way to train a softmax classifier)

which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.

Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:

In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary

Noting that negative sampling also allows us to train "with a much larger target vocabulary".

edited Mar 29 at 16:26

answered Mar 29 at 14:26

Esmailian

2,805318

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48177%2fwhy-does-all-of-nlp-literature-use-noise-contrastive-estimation-loss-for-negativ%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.

The main problem comes from this comment in the linked pdf:

Sampled Softmax

(A faster way to train a softmax classifier)

In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary

Noting that negative sampling also allows us to train "with a much larger target vocabulary".

edited Mar 29 at 16:26

answered Mar 29 at 14:26

Esmailian

2,805318

add a comment |

Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.

The main problem comes from this comment in the linked pdf:

Sampled Softmax

(A faster way to train a softmax classifier)

In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary

Noting that negative sampling also allows us to train "with a much larger target vocabulary".

edited Mar 29 at 16:26

answered Mar 29 at 14:26

Esmailian

2,805318

add a comment |

Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.

The main problem comes from this comment in the linked pdf:

Sampled Softmax

(A faster way to train a softmax classifier)

In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary

Noting that negative sampling also allows us to train "with a much larger target vocabulary".

edited Mar 29 at 16:26

answered Mar 29 at 14:26

Esmailian

2,805318

Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.

The main problem comes from this comment in the linked pdf:

Sampled Softmax

(A faster way to train a softmax classifier)

In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary

Noting that negative sampling also allows us to train "with a much larger target vocabulary".

edited Mar 29 at 16:26

answered Mar 29 at 14:26

Esmailian

2,805318

edited Mar 29 at 16:26

answered Mar 29 at 14:26

Esmailian

2,805318

answered Mar 29 at 14:26

Esmailian

2,805318

answered Mar 29 at 14:26

Esmailian

2,805318

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1