How does Phrases in Gensim work?2019 Community Moderator Electionsemantic relation or semantic relatedness between terms or phrasesHow to retrive the results saved in model of gensim?How to initialize a new word2vec model with pre-trained model weights?How do “intent recognisers” work?Memory error - Hierarchical Dirichlet Process, HDP gensimWhat is Word2Vec approachHow word embedding work for word similarity?How to alter word2vec wikipedia model for n-grams?In Word2Vec, how to the vector values translate back to related words?Skip-thought models applied to phrases instead of sentences

Why doesn't a class having private constructor prevent inheriting from this class? How to control which classes can inherit from a certain base?

Has there ever been an airliner design involving reducing generator load by installing solar panels?

Client team has low performances and low technical skills: we always fix their work and now they stop collaborate with us. How to solve?

Can you really stack all of this on an Opportunity Attack?

Why can't we play rap on piano?

How can bays and straits be determined in a procedurally generated map?

If human space travel is limited by the G force vulnerability, is there a way to counter G forces?

Took a trip to a parallel universe, need help deciphering

Can a Cauchy sequence converge for one metric while not converging for another?

How bulky would the original autograph of the Torah been?

What does the 过 mean in 我有点过厌他?

Why is 150k or 200k jobs considered good when there's 300k+ births a month?

Theorems that impeded progress

Opposition of Newton's law

What's the point of deactivating Num Lock on login screens?

Java Casting: Java 11 throws LambdaConversionException while 1.8 does not

Get value of a counter

Why "Having chlorophyll without photosynthesis is actually very dangerous" and "like living with a bomb"?

Why doesn't H₄O²⁺ exist?

How is it possible to have an ability score that is less than 3?

Are astronomers waiting to see something in an image from a gravitational lens that they've already seen in an adjacent image?

Modeling an IP Address

Operational amplifier as comparator at high frequency

Can I ask the recruiters in my resume to put the reason why I am rejected?

How does Phrases in Gensim work?

2019 Community Moderator Electionsemantic relation or semantic relatedness between terms or phrasesHow to retrive the results saved in model of gensim?How to initialize a new word2vec model with pre-trained model weights?How do “intent recognisers” work?Memory error - Hierarchical Dirichlet Process, HDP gensimWhat is Word2Vec approachHow word embedding work for word similarity?How to alter word2vec wikipedia model for n-grams?In Word2Vec, how to the vector values translate back to related words?Skip-thought models applied to phrases instead of sentences

I am using Gensim Phrases to detct n-grams in my text. Thus, I am interested in knowing the mechanism that Phrases uses to detect these n-grams in the text. Can someone please explain the mechanism used in Phrases in simple terms?

asked Dec 10 '17 at 2:06

Volka

2811315

add a comment |

asked Dec 10 '17 at 2:06

Volka

2811315

add a comment |

asked Dec 10 '17 at 2:06

Volka

2811315

nlp word2vec gensim

asked Dec 10 '17 at 2:06

Volka

2811315

asked Dec 10 '17 at 2:06

Volka

2811315

asked Dec 10 '17 at 2:06

Volka

2811315

asked Dec 10 '17 at 2:06

Volka

2811315

asked Dec 10 '17 at 2:06

Volka

2811315

add a comment |

2 Answers
2

active

oldest

votes

As the gensim tool cites the very famous paper by Mikolov - "Distributed Representations of Words and Phrases..." using which it is implemented. In the paper if you look at the section "4 Learning Phrases" they give a nice explanation of how n-grams are calculated (Equation 6).

enter image description here

So, if want to count bigrams this formula is straight-forward; score(wi, wj) is the score between any two words occuring together. But when counting trigrams, 'wi' will be a bigram and 'wj' will be a word. And same follows for any number of grams.

answered May 29 '18 at 2:09

flyingDope

368128

add a comment |

Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).

The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):

def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
 #...
 """
 worda_count : int
 Number of occurrences for first word.
 wordb_count : int
 Number of occurrences for second word.
 bigram_count : int
 Number of co-occurrences for phrase "worda_wordb".
 len_vocab : int
 Size of vocabulary.
 min_count: int
 Minimum collocation count threshold.
 corpus_word_count : int
 Not used in this particular scoring technique.
 """
 #...

 return (bigram_count - min_count) / worda_count / wordb_count * len_vocab

Another implemented score function is npmi_scorer based on a paper by G. Bouma.

I think n-grams for n>2 are done by applying bigram detection n-1 times.

If min_count (i.e. $delta$) was zero and if instead len_vocab we multiplied by corpus_word_count, then the result of original_scorer would be essentially the ratio of the probability to see wordb following worda and the unconditional probability to see wordb at a random position, that is how many times the presence of worda increases the probability to see wordb in the next position.

I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.

edited Oct 26 '18 at 13:05

answered Oct 26 '18 at 12:43

Valentas

382314

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25524%2fhow-does-phrases-in-gensim-work%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

enter image description here

answered May 29 '18 at 2:09

flyingDope

368128

add a comment |

enter image description here

answered May 29 '18 at 2:09

flyingDope

368128

add a comment |

enter image description here

answered May 29 '18 at 2:09

flyingDope

368128

enter image description here

answered May 29 '18 at 2:09

flyingDope

368128

answered May 29 '18 at 2:09

flyingDope

368128

answered May 29 '18 at 2:09

flyingDope

368128

answered May 29 '18 at 2:09

flyingDope

368128

add a comment |

Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).

The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):

def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
 #...
 """
 worda_count : int
 Number of occurrences for first word.
 wordb_count : int
 Number of occurrences for second word.
 bigram_count : int
 Number of co-occurrences for phrase "worda_wordb".
 len_vocab : int
 Size of vocabulary.
 min_count: int
 Minimum collocation count threshold.
 corpus_word_count : int
 Not used in this particular scoring technique.
 """
 #...

 return (bigram_count - min_count) / worda_count / wordb_count * len_vocab

Another implemented score function is npmi_scorer based on a paper by G. Bouma.

I think n-grams for n>2 are done by applying bigram detection n-1 times.

I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.

edited Oct 26 '18 at 13:05

answered Oct 26 '18 at 12:43

Valentas

382314

add a comment |

Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).

The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):

def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
 #...
 """
 worda_count : int
 Number of occurrences for first word.
 wordb_count : int
 Number of occurrences for second word.
 bigram_count : int
 Number of co-occurrences for phrase "worda_wordb".
 len_vocab : int
 Size of vocabulary.
 min_count: int
 Minimum collocation count threshold.
 corpus_word_count : int
 Not used in this particular scoring technique.
 """
 #...

 return (bigram_count - min_count) / worda_count / wordb_count * len_vocab

Another implemented score function is npmi_scorer based on a paper by G. Bouma.

I think n-grams for n>2 are done by applying bigram detection n-1 times.

I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.

edited Oct 26 '18 at 13:05

answered Oct 26 '18 at 12:43

Valentas

382314

add a comment |

Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).

The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):

def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
 #...
 """
 worda_count : int
 Number of occurrences for first word.
 wordb_count : int
 Number of occurrences for second word.
 bigram_count : int
 Number of co-occurrences for phrase "worda_wordb".
 len_vocab : int
 Size of vocabulary.
 min_count: int
 Minimum collocation count threshold.
 corpus_word_count : int
 Not used in this particular scoring technique.
 """
 #...

 return (bigram_count - min_count) / worda_count / wordb_count * len_vocab

Another implemented score function is npmi_scorer based on a paper by G. Bouma.

I think n-grams for n>2 are done by applying bigram detection n-1 times.

I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.

edited Oct 26 '18 at 13:05

answered Oct 26 '18 at 12:43

Valentas

382314

Gensim detects a bigram if a scoring function for two words exceeds a threshold (which is a parameter for Phrases).

The default scoring function is what is in the answer by flyingDope, but multiplied by vocabulary size (use help(Phraser) or see the gensim's Github repository (gensim/models/phrases.py)):

def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
 #...
 """
 worda_count : int
 Number of occurrences for first word.
 wordb_count : int
 Number of occurrences for second word.
 bigram_count : int
 Number of co-occurrences for phrase "worda_wordb".
 len_vocab : int
 Size of vocabulary.
 min_count: int
 Minimum collocation count threshold.
 corpus_word_count : int
 Not used in this particular scoring technique.
 """
 #...

 return (bigram_count - min_count) / worda_count / wordb_count * len_vocab

Another implemented score function is npmi_scorer based on a paper by G. Bouma.

I think n-grams for n>2 are done by applying bigram detection n-1 times.

I cannot understand why gensim chose to use len_vocab here, but perhaps they had some reason to. You can pass your own scoring function as well.

edited Oct 26 '18 at 13:05

answered Oct 26 '18 at 12:43

Valentas

382314

edited Oct 26 '18 at 13:05

answered Oct 26 '18 at 12:43

Valentas

382314

answered Oct 26 '18 at 12:43

Valentas

382314

answered Oct 26 '18 at 12:43

Valentas

382314

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2