Why does all of NLP literature use Noise contrastive estimation loss for negative sampling instead of sampled softmax loss?2019 Community Moderator ElectionIntuitive explanation of Noise Contrastive Estimation (NCE) loss?Why Logistic regression into Spark Mllib does not use Maximum likelihood estimation?Extract density estimator from discriminator (MLE method) in adversarial trainingWord classification (not text classification) using NLP
Are there any other methods to apply to solving simultaneous equations?
Does light intensity oscillate really fast since it is a wave?
Denied boarding due to overcrowding, Sparpreis ticket. What are my rights?
Is there a familial term for apples and pears?
Is Social Media Science Fiction?
What is the steepest angle that a canal can be traversable without locks?
"listening to me about as much as you're listening to this pole here"
If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?
Are objects structures and/or vice versa?
Is there a name of the flying bionic bird?
Does the average primeness of natural numbers tend to zero?
Is ipsum/ipsa/ipse a third person pronoun, or can it serve other functions?
Ideas for 3rd eye abilities
What do you call something that goes against the spirit of the law, but is legal when interpreting the law to the letter?
Landing in very high winds
How to move the player while also allowing forces to affect it
What do the Banks children have against barley water?
Copycat chess is back
Why do UK politicians seemingly ignore opinion polls on Brexit?
Why doesn't a const reference extend the life of a temporary object passed via a function?
How to deal with fear of taking dependencies
Pristine Bit Checking
Patience, young "Padovan"
How to answer pointed "are you quitting" questioning when I don't want them to suspect
Why does all of NLP literature use Noise contrastive estimation loss for negative sampling instead of sampled softmax loss?
2019 Community Moderator ElectionIntuitive explanation of Noise Contrastive Estimation (NCE) loss?Why Logistic regression into Spark Mllib does not use Maximum likelihood estimation?Extract density estimator from discriminator (MLE method) in adversarial trainingWord classification (not text classification) using NLP
$begingroup$
A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.
This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.
More details here
https://www.tensorflow.org/extras/candidate_sampling.pdf
I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.
Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.
machine-learning nlp word2vec word-embeddings
$endgroup$
add a comment |
$begingroup$
A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.
This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.
More details here
https://www.tensorflow.org/extras/candidate_sampling.pdf
I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.
Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.
machine-learning nlp word2vec word-embeddings
$endgroup$
add a comment |
$begingroup$
A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.
This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.
More details here
https://www.tensorflow.org/extras/candidate_sampling.pdf
I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.
Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.
machine-learning nlp word2vec word-embeddings
$endgroup$
A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.
This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.
More details here
https://www.tensorflow.org/extras/candidate_sampling.pdf
I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.
Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.
machine-learning nlp word2vec word-embeddings
machine-learning nlp word2vec word-embeddings
asked Mar 28 at 20:44
SantoshGupta7SantoshGupta7
1134
1134
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.
The main problem comes from this comment in the linked pdf:
Sampled Softmax
(A faster way to train a softmax classifier)
which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.
Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:
In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary
Noting that negative sampling also allows us to train "with a much larger target vocabulary".
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48177%2fwhy-does-all-of-nlp-literature-use-noise-contrastive-estimation-loss-for-negativ%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.
The main problem comes from this comment in the linked pdf:
Sampled Softmax
(A faster way to train a softmax classifier)
which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.
Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:
In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary
Noting that negative sampling also allows us to train "with a much larger target vocabulary".
$endgroup$
add a comment |
$begingroup$
Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.
The main problem comes from this comment in the linked pdf:
Sampled Softmax
(A faster way to train a softmax classifier)
which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.
Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:
In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary
Noting that negative sampling also allows us to train "with a much larger target vocabulary".
$endgroup$
add a comment |
$begingroup$
Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.
The main problem comes from this comment in the linked pdf:
Sampled Softmax
(A faster way to train a softmax classifier)
which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.
Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:
In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary
Noting that negative sampling also allows us to train "with a much larger target vocabulary".
$endgroup$
Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.
The main problem comes from this comment in the linked pdf:
Sampled Softmax
(A faster way to train a softmax classifier)
which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.
Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:
In this paper, we propose an approximate training algorithm based on
(biased) importance sampling that allows us to train an NMT model with
a much larger target vocabulary
Noting that negative sampling also allows us to train "with a much larger target vocabulary".
edited Mar 29 at 16:26
answered Mar 29 at 14:26
EsmailianEsmailian
2,805318
2,805318
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48177%2fwhy-does-all-of-nlp-literature-use-noise-contrastive-estimation-loss-for-negativ%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown