How does Phrases in Gensim work?2019 Community Moderator Electionsemantic relation or semantic relatedness between terms or phrasesHow to retrive the results saved in model of gensim?How to initialize a new word2vec model with pre-trained model weights?How do “intent recognisers” work?Memory error - Hierarchical Dirichlet Process, HDP gensimWhat is Word2Vec approachHow word embedding work for word similarity?How to alter word2vec wikipedia model for n-grams?In Word2Vec, how to the vector values translate back to related words?Skip-thought models applied to phrases instead of sentences

Why doesn't a class having private constructor prevent inheriting from this class? How to control which classes can inherit from a certain base?

Has there ever been an airliner design involving reducing generator load by installing solar panels?

Client team has low performances and low technical skills: we always fix their work and now they stop collaborate with us. How to solve?

Can you really stack all of this on an Opportunity Attack?

Why can't we play rap on piano?

How can bays and straits be determined in a procedurally generated map?

If human space travel is limited by the G force vulnerability, is there a way to counter G forces?

Took a trip to a parallel universe, need help deciphering

Can a Cauchy sequence converge for one metric while not converging for another?

How bulky would the original autograph of the Torah been?

What does the 过 mean in 我有点过厌他?

Why is 150k or 200k jobs considered good when there's 300k+ births a month?

Theorems that impeded progress

Opposition of Newton's law

What's the point of deactivating Num Lock on login screens?

Java Casting: Java 11 throws LambdaConversionException while 1.8 does not

Get value of a counter

Why "Having chlorophyll without photosynthesis is actually very dangerous" and "like living with a bomb"?

Why doesn't H₄O²⁺ exist?

How is it possible to have an ability score that is less than 3?

Are astronomers waiting to see something in an image from a gravitational lens that they've already seen in an adjacent image?

Modeling an IP Address

Operational amplifier as comparator at high frequency

Can I ask the recruiters in my resume to put the reason why I am rejected?



How does Phrases in Gensim work?



2019 Community Moderator Electionsemantic relation or semantic relatedness between terms or phrasesHow to retrive the results saved in model of gensim?How to initialize a new word2vec model with pre-trained model weights?How do “intent recognisers” work?Memory error - Hierarchical Dirichlet Process, HDP gensimWhat is Word2Vec approachHow word embedding work for word similarity?How to alter word2vec wikipedia model for n-grams?In Word2Vec, how to the vector values translate back to related words?Skip-thought models applied to phrases instead of sentences










0












$begingroup$


I am using Gensim Phrases to detct n-grams in my text. Thus, I am interested in knowing the mechanism that Phrases uses to detect these n-grams in the text. Can someone please explain the mechanism used in Phrases in simple terms?










share|improve this question









$endgroup$
















    0












    $begingroup$


    I am using Gensim Phrases to detct n-grams in my text. Thus, I am interested in knowing the mechanism that Phrases uses to detect these n-grams in the text. Can someone please explain the mechanism used in Phrases in simple terms?










    share|improve this question









    $endgroup$














      0












      0








      0





      $begingroup$


      I am using Gensim Phrases to detct n-grams in my text. Thus, I am interested in knowing the mechanism that Phrases uses to detect these n-grams in the text. Can someone please explain the mechanism used in Phrases in simple terms?










      share|improve this question









      $endgroup$




      I am using Gensim Phrases to detct n-grams in my text. Thus, I am interested in knowing the mechanism that Phrases uses to detect these n-grams in the text. Can someone please explain the mechanism used in Phrases in simple terms?







      nlp word2vec gensim






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 10 '17 at 2:06









      VolkaVolka

      2811315




      2811315




















          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          As the gensim tool cites the very famous paper by Mikolov - "Distributed Representations of Words and Phrases..." using which it is implemented. In the paper if you look at the section "4 Learning Phrases" they give a nice explanation of how n-grams are calculated (Equation 6).



          enter image description here



          So, if want to count bigrams this formula is straight-forward; score(wi, wj) is the score between any two words occuring together. But when counting trigrams, 'wi' will be a bigram and 'wj' will be a word. And same follows for any number of grams.






          share|improve this answer









          $endgroup$




















            0












            $begingroup$

            Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).



            The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):



            def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
            #...
            """
            worda_count : int
            Number of occurrences for first word.
            wordb_count : int
            Number of occurrences for second word.
            bigram_count : int
            Number of co-occurrences for phrase "worda_wordb".
            len_vocab : int
            Size of vocabulary.
            min_count: int
            Minimum collocation count threshold.
            corpus_word_count : int
            Not used in this particular scoring technique.
            """
            #...

            return (bigram_count - min_count) / worda_count / wordb_count * len_vocab


            Another implemented score function is npmi_scorer based on a paper by G. Bouma.



            I think n-grams for n>2 are done by applying bigram detection n-1 times.



            If min_count (i.e. $delta$) was zero and if instead len_vocab we multiplied by corpus_word_count, then the result of original_scorer would be essentially the ratio of the probability to see wordb following worda and the unconditional probability to see wordb at a random position, that is how many times the presence of worda increases the probability to see wordb in the next position.



            I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.






            share|improve this answer











            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function ()
              return StackExchange.using("mathjaxEditing", function ()
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              );
              );
              , "mathjax-editing");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "557"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25524%2fhow-does-phrases-in-gensim-work%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0












              $begingroup$

              As the gensim tool cites the very famous paper by Mikolov - "Distributed Representations of Words and Phrases..." using which it is implemented. In the paper if you look at the section "4 Learning Phrases" they give a nice explanation of how n-grams are calculated (Equation 6).



              enter image description here



              So, if want to count bigrams this formula is straight-forward; score(wi, wj) is the score between any two words occuring together. But when counting trigrams, 'wi' will be a bigram and 'wj' will be a word. And same follows for any number of grams.






              share|improve this answer









              $endgroup$

















                0












                $begingroup$

                As the gensim tool cites the very famous paper by Mikolov - "Distributed Representations of Words and Phrases..." using which it is implemented. In the paper if you look at the section "4 Learning Phrases" they give a nice explanation of how n-grams are calculated (Equation 6).



                enter image description here



                So, if want to count bigrams this formula is straight-forward; score(wi, wj) is the score between any two words occuring together. But when counting trigrams, 'wi' will be a bigram and 'wj' will be a word. And same follows for any number of grams.






                share|improve this answer









                $endgroup$















                  0












                  0








                  0





                  $begingroup$

                  As the gensim tool cites the very famous paper by Mikolov - "Distributed Representations of Words and Phrases..." using which it is implemented. In the paper if you look at the section "4 Learning Phrases" they give a nice explanation of how n-grams are calculated (Equation 6).



                  enter image description here



                  So, if want to count bigrams this formula is straight-forward; score(wi, wj) is the score between any two words occuring together. But when counting trigrams, 'wi' will be a bigram and 'wj' will be a word. And same follows for any number of grams.






                  share|improve this answer









                  $endgroup$



                  As the gensim tool cites the very famous paper by Mikolov - "Distributed Representations of Words and Phrases..." using which it is implemented. In the paper if you look at the section "4 Learning Phrases" they give a nice explanation of how n-grams are calculated (Equation 6).



                  enter image description here



                  So, if want to count bigrams this formula is straight-forward; score(wi, wj) is the score between any two words occuring together. But when counting trigrams, 'wi' will be a bigram and 'wj' will be a word. And same follows for any number of grams.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered May 29 '18 at 2:09









                  flyingDopeflyingDope

                  368128




                  368128





















                      0












                      $begingroup$

                      Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).



                      The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):



                      def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
                      #...
                      """
                      worda_count : int
                      Number of occurrences for first word.
                      wordb_count : int
                      Number of occurrences for second word.
                      bigram_count : int
                      Number of co-occurrences for phrase "worda_wordb".
                      len_vocab : int
                      Size of vocabulary.
                      min_count: int
                      Minimum collocation count threshold.
                      corpus_word_count : int
                      Not used in this particular scoring technique.
                      """
                      #...

                      return (bigram_count - min_count) / worda_count / wordb_count * len_vocab


                      Another implemented score function is npmi_scorer based on a paper by G. Bouma.



                      I think n-grams for n>2 are done by applying bigram detection n-1 times.



                      If min_count (i.e. $delta$) was zero and if instead len_vocab we multiplied by corpus_word_count, then the result of original_scorer would be essentially the ratio of the probability to see wordb following worda and the unconditional probability to see wordb at a random position, that is how many times the presence of worda increases the probability to see wordb in the next position.



                      I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.






                      share|improve this answer











                      $endgroup$

















                        0












                        $begingroup$

                        Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).



                        The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):



                        def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
                        #...
                        """
                        worda_count : int
                        Number of occurrences for first word.
                        wordb_count : int
                        Number of occurrences for second word.
                        bigram_count : int
                        Number of co-occurrences for phrase "worda_wordb".
                        len_vocab : int
                        Size of vocabulary.
                        min_count: int
                        Minimum collocation count threshold.
                        corpus_word_count : int
                        Not used in this particular scoring technique.
                        """
                        #...

                        return (bigram_count - min_count) / worda_count / wordb_count * len_vocab


                        Another implemented score function is npmi_scorer based on a paper by G. Bouma.



                        I think n-grams for n>2 are done by applying bigram detection n-1 times.



                        If min_count (i.e. $delta$) was zero and if instead len_vocab we multiplied by corpus_word_count, then the result of original_scorer would be essentially the ratio of the probability to see wordb following worda and the unconditional probability to see wordb at a random position, that is how many times the presence of worda increases the probability to see wordb in the next position.



                        I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.






                        share|improve this answer











                        $endgroup$















                          0












                          0








                          0





                          $begingroup$

                          Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).



                          The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):



                          def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
                          #...
                          """
                          worda_count : int
                          Number of occurrences for first word.
                          wordb_count : int
                          Number of occurrences for second word.
                          bigram_count : int
                          Number of co-occurrences for phrase "worda_wordb".
                          len_vocab : int
                          Size of vocabulary.
                          min_count: int
                          Minimum collocation count threshold.
                          corpus_word_count : int
                          Not used in this particular scoring technique.
                          """
                          #...

                          return (bigram_count - min_count) / worda_count / wordb_count * len_vocab


                          Another implemented score function is npmi_scorer based on a paper by G. Bouma.



                          I think n-grams for n>2 are done by applying bigram detection n-1 times.



                          If min_count (i.e. $delta$) was zero and if instead len_vocab we multiplied by corpus_word_count, then the result of original_scorer would be essentially the ratio of the probability to see wordb following worda and the unconditional probability to see wordb at a random position, that is how many times the presence of worda increases the probability to see wordb in the next position.



                          I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.






                          share|improve this answer











                          $endgroup$



                          Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).



                          The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):



                          def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
                          #...
                          """
                          worda_count : int
                          Number of occurrences for first word.
                          wordb_count : int
                          Number of occurrences for second word.
                          bigram_count : int
                          Number of co-occurrences for phrase "worda_wordb".
                          len_vocab : int
                          Size of vocabulary.
                          min_count: int
                          Minimum collocation count threshold.
                          corpus_word_count : int
                          Not used in this particular scoring technique.
                          """
                          #...

                          return (bigram_count - min_count) / worda_count / wordb_count * len_vocab


                          Another implemented score function is npmi_scorer based on a paper by G. Bouma.



                          I think n-grams for n>2 are done by applying bigram detection n-1 times.



                          If min_count (i.e. $delta$) was zero and if instead len_vocab we multiplied by corpus_word_count, then the result of original_scorer would be essentially the ratio of the probability to see wordb following worda and the unconditional probability to see wordb at a random position, that is how many times the presence of worda increases the probability to see wordb in the next position.



                          I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Oct 26 '18 at 13:05

























                          answered Oct 26 '18 at 12:43









                          ValentasValentas

                          382314




                          382314



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25524%2fhow-does-phrases-in-gensim-work%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                              Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                              Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High