how to create word2vec for phrases and then calculate cosine similarity Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to find similarity/distance matrix with mixed Continuous and Categorical data?how to provide alternate for a word in corpus for word2vec modelCalculate cosine similarity in Apache SparkAlternatives to TF-IDF and Cosine Similarity when comparing documents of differing formatsCosine similarity between query and document confusionDoc2vec to calculate cosine similarity - absolutely inaccurateHow word embedding work for word similarity?How to alter word2vec wikipedia model for n-grams?What is the best way to use word2vec for bilingual text similarity?cosine similarity between items (purchase data) and normalisation

Is it acceptable to use working hours to read general interest books?

Will I lose my paid in full property

Stretch a Tikz tree

yticklabels on the right side of yaxis

How to translate "red flag" into Spanish?

State of Debian Stable (Stretch) Repository between time of two versions (e.g. 9.8 to 9.9)

Capturing a lambda in another lambda can violate const qualifiers

Why does Java have support for time zone offsets with seconds precision?

using NDEigensystem to solve the Mathieu equation

All ASCII characters with a given bit count

What is the ongoing value of the Kanban board to the developers as opposed to management

Could a cockatrice have parasitic embryos?

Writing a T-SQL stored procedure to receive 4 numbers and insert them into a table

When I export an AI 300x60 art board it saves with bigger dimensions

What helicopter has the most rotor blades?

What's the difference between using dependency injection with a container and using a service locator?

Are there existing rules/lore for MTG planeswalkers?

Is Bran literally the world's memory?

Retract an already submitted recommendation letter (written for an undergrad student)

Raising a bilingual kid. When should we introduce the majority language?

RIP Packet Format

Why does the Cisco show run command not show the full version, while the show version command does?

France's Public Holidays' Puzzle

Guitar neck keeps tilting down

how to create word2vec for phrases and then calculate cosine similarity

Unicorn Meta Zoo #1: Why another podcast?

Announcing the arrival of Valued Associate #679: Cesar Manara

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsHow to find similarity/distance matrix with mixed Continuous and Categorical data?how to provide alternate for a word in corpus for word2vec modelCalculate cosine similarity in Apache SparkAlternatives to TF-IDF and Cosine Similarity when comparing documents of differing formatsCosine similarity between query and document confusionDoc2vec to calculate cosine similarity - absolutely inaccurateHow word embedding work for word similarity?How to alter word2vec wikipedia model for n-grams?What is the best way to use word2vec for bilingual text similarity?cosine similarity between items (purchase data) and normalisation

I have just started using word2vec and I have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate cosine similarity between these 2 lists.

For example :

list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']

Any help would be appreciated.

edited Apr 5 at 9:06

timleathart

2,4091029

asked Apr 5 at 7:49

user3778289

141

add a comment |

For example :

list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']

Any help would be appreciated.

edited Apr 5 at 9:06

timleathart

2,4091029

asked Apr 5 at 7:49

user3778289

141

add a comment |

For example :

list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']

Any help would be appreciated.

edited Apr 5 at 9:06

timleathart

2,4091029

asked Apr 5 at 7:49

user3778289

141

For example :

list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']

Any help would be appreciated.

python word2vec data-analysis cosine-distance

edited Apr 5 at 9:06

timleathart

2,4091029

asked Apr 5 at 7:49

user3778289

141

edited Apr 5 at 9:06

timleathart

2,4091029

asked Apr 5 at 7:49

user3778289

141

edited Apr 5 at 9:06

timleathart

2,4091029

edited Apr 5 at 9:06

timleathart

2,4091029

edited Apr 5 at 9:06

timleathart

2,4091029

asked Apr 5 at 7:49

user3778289

141

asked Apr 5 at 7:49

user3778289

141

asked Apr 5 at 7:49

user3778289

141

add a comment |

2 Answers
2

active

oldest

votes

enter image description here

Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.

A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)

Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :

Identify all nouns and other phrases based on POS tagging

Identify all bi-grams, tri-grams

Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).

Once terms have been identified, Word Vector algo will work as it is :

Train word vector model

Concat phrases into single tokens and retrain the model

Merge these 2 models

Following patent from Google has more details

https://patents.google.com/patent/CN106776713A/en

Other papers that have examples of domains where term vectors have been evaluated / used :

https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf

https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf

https://www.sciencedirect.com/science/article/pii/S1532046417302769

http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf

answered Apr 5 at 10:06

Shamit Verma

1,6641414

add a comment |

You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:

phrase = model.infer_vector(['microsoft', 'visual', 'studio'])

You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.

phrase = w2v('microsoft') + w2v('visual') + w2v('studio')

This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.

Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.

phrase = w2v('cloud_computing')

Extra directions:

Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.

Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.

edited Apr 5 at 17:41

answered Apr 5 at 8:38

Esmailian

3,736420

$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46

$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24

$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48671%2fhow-to-create-word2vec-for-phrases-and-then-calculate-cosine-similarity%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

enter image description here

Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.

A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)

Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :

Identify all nouns and other phrases based on POS tagging

Identify all bi-grams, tri-grams

Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).

Once terms have been identified, Word Vector algo will work as it is :

Train word vector model

Concat phrases into single tokens and retrain the model

Merge these 2 models

Following patent from Google has more details

https://patents.google.com/patent/CN106776713A/en

Other papers that have examples of domains where term vectors have been evaluated / used :

https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf

https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf

https://www.sciencedirect.com/science/article/pii/S1532046417302769

http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf

answered Apr 5 at 10:06

Shamit Verma

1,6641414

add a comment |

enter image description here

Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.

A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)

Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :

Identify all nouns and other phrases based on POS tagging

Identify all bi-grams, tri-grams

Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).

Once terms have been identified, Word Vector algo will work as it is :

Train word vector model

Concat phrases into single tokens and retrain the model

Merge these 2 models

Following patent from Google has more details

https://patents.google.com/patent/CN106776713A/en

Other papers that have examples of domains where term vectors have been evaluated / used :

https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf

https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf

https://www.sciencedirect.com/science/article/pii/S1532046417302769

http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf

answered Apr 5 at 10:06

Shamit Verma

1,6641414

add a comment |

enter image description here

Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.

A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)

Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :

Identify all nouns and other phrases based on POS tagging

Identify all bi-grams, tri-grams

Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).

Once terms have been identified, Word Vector algo will work as it is :

Train word vector model

Concat phrases into single tokens and retrain the model

Merge these 2 models

Following patent from Google has more details

https://patents.google.com/patent/CN106776713A/en

Other papers that have examples of domains where term vectors have been evaluated / used :

https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf

https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf

https://www.sciencedirect.com/science/article/pii/S1532046417302769

http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf

answered Apr 5 at 10:06

Shamit Verma

1,6641414

enter image description here

Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.

A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)

Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :

Identify all nouns and other phrases based on POS tagging

Identify all bi-grams, tri-grams

Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).

Once terms have been identified, Word Vector algo will work as it is :

Train word vector model

Concat phrases into single tokens and retrain the model

Merge these 2 models

Following patent from Google has more details

https://patents.google.com/patent/CN106776713A/en

Other papers that have examples of domains where term vectors have been evaluated / used :

https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf

https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf

https://www.sciencedirect.com/science/article/pii/S1532046417302769

http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf

answered Apr 5 at 10:06

Shamit Verma

1,6641414

answered Apr 5 at 10:06

Shamit Verma

1,6641414

answered Apr 5 at 10:06

Shamit Verma

1,6641414

answered Apr 5 at 10:06

Shamit Verma

1,6641414

add a comment |

You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:

phrase = model.infer_vector(['microsoft', 'visual', 'studio'])

You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.

phrase = w2v('microsoft') + w2v('visual') + w2v('studio')

Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.

phrase = w2v('cloud_computing')

Extra directions:

Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.

Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.

edited Apr 5 at 17:41

answered Apr 5 at 8:38

Esmailian

3,736420

$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46

$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24

$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27

add a comment |

You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:

phrase = model.infer_vector(['microsoft', 'visual', 'studio'])

You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.

phrase = w2v('microsoft') + w2v('visual') + w2v('studio')

Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.

phrase = w2v('cloud_computing')

Extra directions:

Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.

Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.

edited Apr 5 at 17:41

answered Apr 5 at 8:38

Esmailian

3,736420

$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46

$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24

$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27

add a comment |

You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:

phrase = model.infer_vector(['microsoft', 'visual', 'studio'])

You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.

phrase = w2v('microsoft') + w2v('visual') + w2v('studio')

Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.

phrase = w2v('cloud_computing')

Extra directions:

Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.

Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.

edited Apr 5 at 17:41

answered Apr 5 at 8:38

Esmailian

3,736420

You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:

phrase = model.infer_vector(['microsoft', 'visual', 'studio'])

You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.

phrase = w2v('microsoft') + w2v('visual') + w2v('studio')

Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.

phrase = w2v('cloud_computing')

Extra directions:

Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.

Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.

edited Apr 5 at 17:41

answered Apr 5 at 8:38

Esmailian

3,736420

edited Apr 5 at 17:41

answered Apr 5 at 8:38

Esmailian

3,736420

answered Apr 5 at 8:38

Esmailian

3,736420

answered Apr 5 at 8:38

Esmailian

3,736420

$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46

$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24

$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27

add a comment |

$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46

$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24

$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27

@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?

– user3778289
Apr 5 at 8:46

you are saying i cant use word2vec for phrases?

– user3778289
Apr 5 at 11:24

@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.

– Esmailian
Apr 5 at 11:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

893NC0wl8QD,Zh,0z88tPx WHs,VAij2XDDIYm6yzz9po,O2vM wF4,Ja0 io,O,mZm,61uw GTv4RvnD,cNs1w6YvkVvqfc

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2