how to create word2vec for phrases and then calculate cosine similarity Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to find similarity/distance matrix with mixed Continuous and Categorical data?how to provide alternate for a word in corpus for word2vec modelCalculate cosine similarity in Apache SparkAlternatives to TF-IDF and Cosine Similarity when comparing documents of differing formatsCosine similarity between query and document confusionDoc2vec to calculate cosine similarity - absolutely inaccurateHow word embedding work for word similarity?How to alter word2vec wikipedia model for n-grams?What is the best way to use word2vec for bilingual text similarity?cosine similarity between items (purchase data) and normalisation
Is it acceptable to use working hours to read general interest books?
Will I lose my paid in full property
Stretch a Tikz tree
yticklabels on the right side of yaxis
How to translate "red flag" into Spanish?
State of Debian Stable (Stretch) Repository between time of two versions (e.g. 9.8 to 9.9)
Capturing a lambda in another lambda can violate const qualifiers
Why does Java have support for time zone offsets with seconds precision?
using NDEigensystem to solve the Mathieu equation
All ASCII characters with a given bit count
What is the ongoing value of the Kanban board to the developers as opposed to management
Could a cockatrice have parasitic embryos?
Writing a T-SQL stored procedure to receive 4 numbers and insert them into a table
When I export an AI 300x60 art board it saves with bigger dimensions
What helicopter has the most rotor blades?
What's the difference between using dependency injection with a container and using a service locator?
Are there existing rules/lore for MTG planeswalkers?
Is Bran literally the world's memory?
Retract an already submitted recommendation letter (written for an undergrad student)
Raising a bilingual kid. When should we introduce the majority language?
RIP Packet Format
Why does the Cisco show run command not show the full version, while the show version command does?
France's Public Holidays' Puzzle
Guitar neck keeps tilting down
how to create word2vec for phrases and then calculate cosine similarity
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow to find similarity/distance matrix with mixed Continuous and Categorical data?how to provide alternate for a word in corpus for word2vec modelCalculate cosine similarity in Apache SparkAlternatives to TF-IDF and Cosine Similarity when comparing documents of differing formatsCosine similarity between query and document confusionDoc2vec to calculate cosine similarity - absolutely inaccurateHow word embedding work for word similarity?How to alter word2vec wikipedia model for n-grams?What is the best way to use word2vec for bilingual text similarity?cosine similarity between items (purchase data) and normalisation
$begingroup$
I have just started using word2vec and I have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate cosine similarity between these 2 lists.
For example :
list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']
Any help would be appreciated.
python word2vec data-analysis cosine-distance
$endgroup$
add a comment |
$begingroup$
I have just started using word2vec and I have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate cosine similarity between these 2 lists.
For example :
list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']
Any help would be appreciated.
python word2vec data-analysis cosine-distance
$endgroup$
add a comment |
$begingroup$
I have just started using word2vec and I have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate cosine similarity between these 2 lists.
For example :
list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']
Any help would be appreciated.
python word2vec data-analysis cosine-distance
$endgroup$
I have just started using word2vec and I have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate cosine similarity between these 2 lists.
For example :
list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']
Any help would be appreciated.
python word2vec data-analysis cosine-distance
python word2vec data-analysis cosine-distance
edited Apr 5 at 9:06
timleathart
2,4091029
2,4091029
asked Apr 5 at 7:49
user3778289user3778289
141
141
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.
A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)
Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :
- Identify all nouns and other phrases based on POS tagging
- Identify all bi-grams, tri-grams
Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).
Once terms have been identified, Word Vector algo will work as it is :
- Train word vector model
- Concat phrases into single tokens and retrain the model
- Merge these 2 models
Following patent from Google has more details
https://patents.google.com/patent/CN106776713A/en
Other papers that have examples of domains where term vectors have been evaluated / used :
https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf
https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf
https://www.sciencedirect.com/science/article/pii/S1532046417302769
https://www.sciencedirect.com/science/article/pii/S1532046417302769
http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf
$endgroup$
add a comment |
$begingroup$
You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:
phrase = model.infer_vector(['microsoft', 'visual', 'studio'])
You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.
phrase = w2v('microsoft') + w2v('visual') + w2v('studio')
This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.
Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.
phrase = w2v('cloud_computing')
Extra directions:
Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.
Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.
$endgroup$
$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46
$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24
$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48671%2fhow-to-create-word2vec-for-phrases-and-then-calculate-cosine-similarity%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.
A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)
Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :
- Identify all nouns and other phrases based on POS tagging
- Identify all bi-grams, tri-grams
Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).
Once terms have been identified, Word Vector algo will work as it is :
- Train word vector model
- Concat phrases into single tokens and retrain the model
- Merge these 2 models
Following patent from Google has more details
https://patents.google.com/patent/CN106776713A/en
Other papers that have examples of domains where term vectors have been evaluated / used :
https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf
https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf
https://www.sciencedirect.com/science/article/pii/S1532046417302769
https://www.sciencedirect.com/science/article/pii/S1532046417302769
http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf
$endgroup$
add a comment |
$begingroup$
Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.
A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)
Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :
- Identify all nouns and other phrases based on POS tagging
- Identify all bi-grams, tri-grams
Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).
Once terms have been identified, Word Vector algo will work as it is :
- Train word vector model
- Concat phrases into single tokens and retrain the model
- Merge these 2 models
Following patent from Google has more details
https://patents.google.com/patent/CN106776713A/en
Other papers that have examples of domains where term vectors have been evaluated / used :
https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf
https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf
https://www.sciencedirect.com/science/article/pii/S1532046417302769
https://www.sciencedirect.com/science/article/pii/S1532046417302769
http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf
$endgroup$
add a comment |
$begingroup$
Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.
A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)
Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :
- Identify all nouns and other phrases based on POS tagging
- Identify all bi-grams, tri-grams
Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).
Once terms have been identified, Word Vector algo will work as it is :
- Train word vector model
- Concat phrases into single tokens and retrain the model
- Merge these 2 models
Following patent from Google has more details
https://patents.google.com/patent/CN106776713A/en
Other papers that have examples of domains where term vectors have been evaluated / used :
https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf
https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf
https://www.sciencedirect.com/science/article/pii/S1532046417302769
https://www.sciencedirect.com/science/article/pii/S1532046417302769
http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf
$endgroup$
Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.
A textbook example is "Chinese river" ~ "Yangtze_River","Qiantang_River" (https://code.google.com/archive/p/word2vec/)
Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :
- Identify all nouns and other phrases based on POS tagging
- Identify all bi-grams, tri-grams
Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).
Once terms have been identified, Word Vector algo will work as it is :
- Train word vector model
- Concat phrases into single tokens and retrain the model
- Merge these 2 models
Following patent from Google has more details
https://patents.google.com/patent/CN106776713A/en
Other papers that have examples of domains where term vectors have been evaluated / used :
https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf
https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf
https://www.sciencedirect.com/science/article/pii/S1532046417302769
https://www.sciencedirect.com/science/article/pii/S1532046417302769
http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf
answered Apr 5 at 10:06
Shamit VermaShamit Verma
1,6641414
1,6641414
add a comment |
add a comment |
$begingroup$
You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:
phrase = model.infer_vector(['microsoft', 'visual', 'studio'])
You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.
phrase = w2v('microsoft') + w2v('visual') + w2v('studio')
This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.
Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.
phrase = w2v('cloud_computing')
Extra directions:
Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.
Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.
$endgroup$
$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46
$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24
$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27
add a comment |
$begingroup$
You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:
phrase = model.infer_vector(['microsoft', 'visual', 'studio'])
You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.
phrase = w2v('microsoft') + w2v('visual') + w2v('studio')
This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.
Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.
phrase = w2v('cloud_computing')
Extra directions:
Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.
Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.
$endgroup$
$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46
$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24
$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27
add a comment |
$begingroup$
You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:
phrase = model.infer_vector(['microsoft', 'visual', 'studio'])
You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.
phrase = w2v('microsoft') + w2v('visual') + w2v('studio')
This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.
Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.
phrase = w2v('cloud_computing')
Extra directions:
Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.
Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.
$endgroup$
You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:
phrase = model.infer_vector(['microsoft', 'visual', 'studio'])
You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.
phrase = w2v('microsoft') + w2v('visual') + w2v('studio')
This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.
Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.
phrase = w2v('cloud_computing')
Extra directions:
Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.
Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.
edited Apr 5 at 17:41
answered Apr 5 at 8:38
EsmailianEsmailian
3,736420
3,736420
$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46
$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24
$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27
add a comment |
$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46
$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24
$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27
$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46
$begingroup$
@ Esmailian i want to find the most similar phase in list2 with all words in list1? i can't do this with word2vec?
$endgroup$
– user3778289
Apr 5 at 8:46
$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24
$begingroup$
you are saying i cant use word2vec for phrases?
$endgroup$
– user3778289
Apr 5 at 11:24
$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27
$begingroup$
@user3778289 If it is considered as a single unique word yes, but composition of words no, unless you use average and sum which I put into the answer.
$endgroup$
– Esmailian
Apr 5 at 11:27
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48671%2fhow-to-create-word2vec-for-phrases-and-then-calculate-cosine-similarity%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown