Why does all of NLP literature use Noise contrastive estimation loss for negative sampling instead of sampled softmax loss?2019 Community Moderator ElectionIntuitive explanation of Noise Contrastive Estimation (NCE) loss?Why Logistic regression into Spark Mllib does not use Maximum likelihood estimation?Extract density estimator from discriminator (MLE method) in adversarial trainingWord classification (not text classification) using NLP

Are there any other methods to apply to solving simultaneous equations?

Does light intensity oscillate really fast since it is a wave?

Denied boarding due to overcrowding, Sparpreis ticket. What are my rights?

Is there a familial term for apples and pears?

Is Social Media Science Fiction?

What is the steepest angle that a canal can be traversable without locks?

"listening to me about as much as you're listening to this pole here"

If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?

Are objects structures and/or vice versa?

Is there a name of the flying bionic bird?

Does the average primeness of natural numbers tend to zero?

Is ipsum/ipsa/ipse a third person pronoun, or can it serve other functions?

Ideas for 3rd eye abilities

What do you call something that goes against the spirit of the law, but is legal when interpreting the law to the letter?

Landing in very high winds

How to move the player while also allowing forces to affect it

What do the Banks children have against barley water?

Copycat chess is back

Why do UK politicians seemingly ignore opinion polls on Brexit?

Why doesn't a const reference extend the life of a temporary object passed via a function?

How to deal with fear of taking dependencies

Pristine Bit Checking

Patience, young "Padovan"

How to answer pointed "are you quitting" questioning when I don't want them to suspect



Why does all of NLP literature use Noise contrastive estimation loss for negative sampling instead of sampled softmax loss?



2019 Community Moderator ElectionIntuitive explanation of Noise Contrastive Estimation (NCE) loss?Why Logistic regression into Spark Mllib does not use Maximum likelihood estimation?Extract density estimator from discriminator (MLE method) in adversarial trainingWord classification (not text classification) using NLP










2












$begingroup$


A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.



This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.



More details here



https://www.tensorflow.org/extras/candidate_sampling.pdf



I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.



Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.










share|improve this question









$endgroup$
















    2












    $begingroup$


    A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.



    This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.



    More details here



    https://www.tensorflow.org/extras/candidate_sampling.pdf



    I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.



    Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.










    share|improve this question









    $endgroup$














      2












      2








      2


      1



      $begingroup$


      A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.



      This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.



      More details here



      https://www.tensorflow.org/extras/candidate_sampling.pdf



      I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.



      Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.










      share|improve this question









      $endgroup$




      A sampled softmax function is like a regular softmax but randomly selects a given number of 'negative' samples.



      This is difference than NCE Loss, which doesn't use a softmax at all, it uses a logistic binary classifier for the context/labels. In NLP, 'Negative Sampling' basically refers to the NCE-based approach.



      More details here



      https://www.tensorflow.org/extras/candidate_sampling.pdf



      I have tested both and they both give pretty much the same results. But in word embedding literature, they always use NCE loss, and never sampled softmax.



      Is there any reason why this is? The sampled softmax seems like the more obvious solution to prevent applying a softmax to all the classes, so I imagine there must be some good reason for the NCE loss.







      machine-learning nlp word2vec word-embeddings






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 28 at 20:44









      SantoshGupta7SantoshGupta7

      1134




      1134




















          1 Answer
          1






          active

          oldest

          votes


















          2












          $begingroup$

          Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.



          The main problem comes from this comment in the linked pdf:




          Sampled Softmax



          (A faster way to train a softmax classifier)




          which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.



          Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:




          In this paper, we propose an approximate training algorithm based on
          (biased) importance sampling that allows us to train an NMT model with
          a much larger target vocabulary




          Noting that negative sampling also allows us to train "with a much larger target vocabulary".






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48177%2fwhy-does-all-of-nlp-literature-use-noise-contrastive-estimation-loss-for-negativ%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2












            $begingroup$

            Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.



            The main problem comes from this comment in the linked pdf:




            Sampled Softmax



            (A faster way to train a softmax classifier)




            which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.



            Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:




            In this paper, we propose an approximate training algorithm based on
            (biased) importance sampling that allows us to train an NMT model with
            a much larger target vocabulary




            Noting that negative sampling also allows us to train "with a much larger target vocabulary".






            share|improve this answer











            $endgroup$

















              2












              $begingroup$

              Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.



              The main problem comes from this comment in the linked pdf:




              Sampled Softmax



              (A faster way to train a softmax classifier)




              which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.



              Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:




              In this paper, we propose an approximate training algorithm based on
              (biased) importance sampling that allows us to train an NMT model with
              a much larger target vocabulary




              Noting that negative sampling also allows us to train "with a much larger target vocabulary".






              share|improve this answer











              $endgroup$















                2












                2








                2





                $begingroup$

                Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.



                The main problem comes from this comment in the linked pdf:




                Sampled Softmax



                (A faster way to train a softmax classifier)




                which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.



                Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:




                In this paper, we propose an approximate training algorithm based on
                (biased) importance sampling that allows us to train an NMT model with
                a much larger target vocabulary




                Noting that negative sampling also allows us to train "with a much larger target vocabulary".






                share|improve this answer











                $endgroup$



                Both negative sampling (derived from NCE) and sampled SoftMax use a few samples to bypass the calculation of full SoftMax.



                The main problem comes from this comment in the linked pdf:




                Sampled Softmax



                (A faster way to train a softmax classifier)




                which is only used for sampled SoftMax, although, negative sampling is as fast for the same reason that is working with few samples. If their performances are at the same level, this could be the reason why researchers are not convinced to switch over to sampled SoftMax. In academia, it is almost always the case that older methods are preferred over new, but equally-competent methods for the sake of credibility.



                Negative sampling is NCE minus the logistic classifier. Roughly speaking, it only borrows the term "F(target) + sum of F(negative sample)s". Negative sampling is most prominently introduced in the Word2Vec paper in 2013 (as of now with 11K citations), and is backed by the mathematically rigorous NCE paper (2012). On the other hand, sampled SoftMax is introduced in this paper (2015) for a task-specific (Machine Translation) and biased approximation:




                In this paper, we propose an approximate training algorithm based on
                (biased) importance sampling that allows us to train an NMT model with
                a much larger target vocabulary




                Noting that negative sampling also allows us to train "with a much larger target vocabulary".







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Mar 29 at 16:26

























                answered Mar 29 at 14:26









                EsmailianEsmailian

                2,805318




                2,805318



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48177%2fwhy-does-all-of-nlp-literature-use-noise-contrastive-estimation-loss-for-negativ%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High