How can I output tokens from MWE Tokenizer?Extracting words belonging to a key from the textHow can I get a measure of the semantic similarity of words?What can be done so that 'teacher' and 'teaches' are treated similar?Words from LDA output pyspark mlHow to read Feature Based Grammar from a stringHow do NLP tokenizers handle hashtags?How to extract a relation from a Named entity recognition model using NLTK in pythonUnderstanding the effect of num_words of Tokenizer in KerasSkip-thought models applied to phrases instead of sentencesHow to evaluate an ngram on test data?

Why do ¬, ∀ and ∃ have the same precedence?

Is a Java collection guaranteed to be in a valid, usable state after a ConcurrentModificationException?

How to draw a matrix with arrows in limited space

The Digit Triangles

Does grappling negate Mirror Image?

Why is so much work done on numerical verification of the Riemann Hypothesis?

Stack Interview Code methods made from class Node and Smart Pointers

"before" and "want" for the same systemd service?

What features enable the Su-25 Frogfoot to operate with such a wide variety of fuels?

How could a planet have erratic days?

Microchip documentation does not label CAN buss pins on micro controller pinout diagram

Taxes on Dividends in a Roth IRA

Is there any evidence that Cleopatra and Caesarion considered fleeing to India to escape the Romans?

Is there a nicer/politer/more positive alternative for "negates"?

Find the next value of this number series

Does "he squandered his car on drink" sound natural?

What does Apple's new App Store requirement mean

Why does Carol not get rid of the Kree symbol on her suit when she changes its colours?

Why is it that I can sometimes guess the next note?

Why should universal income be universal?

Can you use Vicious Mockery to win an argument or gain favours?

Shouldn’t conservatives embrace universal basic income?

The IT department bottlenecks progress, how should I handle this?

Doesn't the system of the Supreme Court oppose justice?

How can I output tokens from MWE Tokenizer?

Extracting words belonging to a key from the textHow can I get a measure of the semantic similarity of words?What can be done so that 'teacher' and 'teaches' are treated similar?Words from LDA output pyspark mlHow to read Feature Based Grammar from a stringHow do NLP tokenizers handle hashtags?How to extract a relation from a Named entity recognition model using NLTK in pythonUnderstanding the effect of num_words of Tokenizer in KerasSkip-thought models applied to phrases instead of sentencesHow to evaluate an ngram on test data?

How to output the tokens produced using MWE Tokenizer?

NLTK's multi-word expression tokenizer (MWETokenizer) provides a method/function add_mwe() that allows the user to enter multiple word expressions prior to using the tokenizer on the text.

Currently, I have a file consisting of phrases / multi-word expression I want to use with the tokenizer. My concern is that the manner in which I am presenting the phrases to the function correctly and so not resulting in the desired set of tokens to be used in tokenizing the incoming text.

So this leads me to ask if anyone knows how to output the token generated by add_mwe() so that I can verify that I am correctly passing the phrase to the function?

edited yesterday

tuomastik

753418

asked Mar 18 at 19:36

Paul

New contributor

add a comment |

How to output the tokens produced using MWE Tokenizer?

NLTK's multi-word expression tokenizer (MWETokenizer) provides a method/function add_mwe() that allows the user to enter multiple word expressions prior to using the tokenizer on the text.

So this leads me to ask if anyone knows how to output the token generated by add_mwe() so that I can verify that I am correctly passing the phrase to the function?

edited yesterday

tuomastik

753418

asked Mar 18 at 19:36

Paul

New contributor

add a comment |

How to output the tokens produced using MWE Tokenizer?

NLTK's multi-word expression tokenizer (MWETokenizer) provides a method/function add_mwe() that allows the user to enter multiple word expressions prior to using the tokenizer on the text.

So this leads me to ask if anyone knows how to output the token generated by add_mwe() so that I can verify that I am correctly passing the phrase to the function?

edited yesterday

tuomastik

753418

asked Mar 18 at 19:36

Paul

New contributor

How to output the tokens produced using MWE Tokenizer?

NLTK's multi-word expression tokenizer (MWETokenizer) provides a method/function add_mwe() that allows the user to enter multiple word expressions prior to using the tokenizer on the text.

So this leads me to ask if anyone knows how to output the token generated by add_mwe() so that I can verify that I am correctly passing the phrase to the function?

nlp nltk tokenization

edited yesterday

tuomastik

753418

asked Mar 18 at 19:36

Paul

New contributor

edited yesterday

tuomastik

753418

asked Mar 18 at 19:36

Paul

New contributor

edited yesterday

tuomastik

753418

edited yesterday

tuomastik

753418

edited yesterday

tuomastik

753418

asked Mar 18 at 19:36

Paul

New contributor

asked Mar 18 at 19:36

Paul

asked Mar 18 at 19:36

Paul

New contributor

Paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

You can check the exact input and output parameters of the add_mwe method in NLTK's documentation for the class here.

This is the expected input:

>>> tokenizer.add_mwe(('in', 'spite', 'of'))

So, each phrase must simply be a tuple with the words in that phrase. If you provide that input, you should get the output you expect (in_spite_of). I've added a full snippet of working code below for convenience, there you can see how to use the class as intended.

Regarding the output of add_mwe, every time you call the method it adds a new word to the dictionary, and all the words are stored in the class's _mwes attribute. So, given mwe = MWETokenizer(), you can then inspect the contents of mwe (with e.g. print mwe._mwes) to see what the class actually stores.

As stated in the documentation, it is actually a Trie with all the terms, so it won't look exactly as the words you added (it is a more efficient representation thereof). The link I mentioned earlier has more details on that.

Hope this helps!

import nltk

from nltk import (
 sent_tokenize as splitter,
 wordpunct_tokenize as tokenizer
)

from nltk.tokenize.mwe import MWETokenizer

test = """Anyone know how to output the tokens produced using MWE Tokenizer?

For a clearer explanation of what I am asking for those who did not understand my original brief question.

The multi-word expression tokenizer (MWETokenizer) provides a method/function (add_mwe()) that allows the user to enter multiple word expressions prior to using the tokenizer on text. Currently I have a file consisting of phrases / multi-word expression I want to use with the tokenizer. My concern is that the manner in which I am presenting the phrases to the function correctly and so not resulting in the desired set of tokens to be used in tokenizing the incoming text. So this leads me to ask if anyone knows how to output the token generated by this method/function so that I can verify that I am correctly passing the phrase to the function (add_mwe()).?"""

mwe = MWETokenizer()

phrases = [
 ('multi', '-', 'word'),
 ('expression', 'tokenizer'),
 ('word', 'expressions'),
 ('multi', '-', 'word', 'expression')
]

for phrase in phrases:
 mwe.add_mwe(phrase)


for sent in splitter(test):
 tokens = tokenizer(sent)
 print ' '.join(tokens)
 print ' '.join(mwe.tokenize(tokens))
 print '---'



# Expected output:
#
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# ---
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# ---
# The multi - word expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word expressions prior to using the tokenizer on text .
# The multi_-_word_expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word_expressions prior to using the tokenizer on text .
# ---
# Currently I have a file consisting of phrases / multi - word expression I want to use with the tokenizer .
# Currently I have a file consisting of phrases / multi_-_word_expression I want to use with the tokenizer .
# ---
# ...

edited yesterday

answered yesterday

JordiCarrera

513

$begingroup$
Thank you! Exactly what I was hoping to learn.
$endgroup$
– Paul
yesterday

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Paul is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47556%2fhow-can-i-output-tokens-from-mwe-tokenizer%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can check the exact input and output parameters of the add_mwe method in NLTK's documentation for the class here.

This is the expected input:

>>> tokenizer.add_mwe(('in', 'spite', 'of'))

Hope this helps!

import nltk

from nltk import (
 sent_tokenize as splitter,
 wordpunct_tokenize as tokenizer
)

from nltk.tokenize.mwe import MWETokenizer

test = """Anyone know how to output the tokens produced using MWE Tokenizer?

For a clearer explanation of what I am asking for those who did not understand my original brief question.

The multi-word expression tokenizer (MWETokenizer) provides a method/function (add_mwe()) that allows the user to enter multiple word expressions prior to using the tokenizer on text. Currently I have a file consisting of phrases / multi-word expression I want to use with the tokenizer. My concern is that the manner in which I am presenting the phrases to the function correctly and so not resulting in the desired set of tokens to be used in tokenizing the incoming text. So this leads me to ask if anyone knows how to output the token generated by this method/function so that I can verify that I am correctly passing the phrase to the function (add_mwe()).?"""

mwe = MWETokenizer()

phrases = [
 ('multi', '-', 'word'),
 ('expression', 'tokenizer'),
 ('word', 'expressions'),
 ('multi', '-', 'word', 'expression')
]

for phrase in phrases:
 mwe.add_mwe(phrase)


for sent in splitter(test):
 tokens = tokenizer(sent)
 print ' '.join(tokens)
 print ' '.join(mwe.tokenize(tokens))
 print '---'



# Expected output:
#
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# ---
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# ---
# The multi - word expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word expressions prior to using the tokenizer on text .
# The multi_-_word_expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word_expressions prior to using the tokenizer on text .
# ---
# Currently I have a file consisting of phrases / multi - word expression I want to use with the tokenizer .
# Currently I have a file consisting of phrases / multi_-_word_expression I want to use with the tokenizer .
# ---
# ...

edited yesterday

answered yesterday

JordiCarrera

513

$begingroup$
Thank you! Exactly what I was hoping to learn.
$endgroup$
– Paul
yesterday

add a comment |

You can check the exact input and output parameters of the add_mwe method in NLTK's documentation for the class here.

This is the expected input:

>>> tokenizer.add_mwe(('in', 'spite', 'of'))

Hope this helps!

import nltk

from nltk import (
 sent_tokenize as splitter,
 wordpunct_tokenize as tokenizer
)

from nltk.tokenize.mwe import MWETokenizer

test = """Anyone know how to output the tokens produced using MWE Tokenizer?

For a clearer explanation of what I am asking for those who did not understand my original brief question.

The multi-word expression tokenizer (MWETokenizer) provides a method/function (add_mwe()) that allows the user to enter multiple word expressions prior to using the tokenizer on text. Currently I have a file consisting of phrases / multi-word expression I want to use with the tokenizer. My concern is that the manner in which I am presenting the phrases to the function correctly and so not resulting in the desired set of tokens to be used in tokenizing the incoming text. So this leads me to ask if anyone knows how to output the token generated by this method/function so that I can verify that I am correctly passing the phrase to the function (add_mwe()).?"""

mwe = MWETokenizer()

phrases = [
 ('multi', '-', 'word'),
 ('expression', 'tokenizer'),
 ('word', 'expressions'),
 ('multi', '-', 'word', 'expression')
]

for phrase in phrases:
 mwe.add_mwe(phrase)


for sent in splitter(test):
 tokens = tokenizer(sent)
 print ' '.join(tokens)
 print ' '.join(mwe.tokenize(tokens))
 print '---'



# Expected output:
#
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# ---
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# ---
# The multi - word expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word expressions prior to using the tokenizer on text .
# The multi_-_word_expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word_expressions prior to using the tokenizer on text .
# ---
# Currently I have a file consisting of phrases / multi - word expression I want to use with the tokenizer .
# Currently I have a file consisting of phrases / multi_-_word_expression I want to use with the tokenizer .
# ---
# ...

edited yesterday

answered yesterday

JordiCarrera

513

$begingroup$
Thank you! Exactly what I was hoping to learn.
$endgroup$
– Paul
yesterday

add a comment |

You can check the exact input and output parameters of the add_mwe method in NLTK's documentation for the class here.

This is the expected input:

>>> tokenizer.add_mwe(('in', 'spite', 'of'))

Hope this helps!

import nltk

from nltk import (
 sent_tokenize as splitter,
 wordpunct_tokenize as tokenizer
)

from nltk.tokenize.mwe import MWETokenizer

test = """Anyone know how to output the tokens produced using MWE Tokenizer?

For a clearer explanation of what I am asking for those who did not understand my original brief question.

The multi-word expression tokenizer (MWETokenizer) provides a method/function (add_mwe()) that allows the user to enter multiple word expressions prior to using the tokenizer on text. Currently I have a file consisting of phrases / multi-word expression I want to use with the tokenizer. My concern is that the manner in which I am presenting the phrases to the function correctly and so not resulting in the desired set of tokens to be used in tokenizing the incoming text. So this leads me to ask if anyone knows how to output the token generated by this method/function so that I can verify that I am correctly passing the phrase to the function (add_mwe()).?"""

mwe = MWETokenizer()

phrases = [
 ('multi', '-', 'word'),
 ('expression', 'tokenizer'),
 ('word', 'expressions'),
 ('multi', '-', 'word', 'expression')
]

for phrase in phrases:
 mwe.add_mwe(phrase)


for sent in splitter(test):
 tokens = tokenizer(sent)
 print ' '.join(tokens)
 print ' '.join(mwe.tokenize(tokens))
 print '---'



# Expected output:
#
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# ---
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# ---
# The multi - word expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word expressions prior to using the tokenizer on text .
# The multi_-_word_expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word_expressions prior to using the tokenizer on text .
# ---
# Currently I have a file consisting of phrases / multi - word expression I want to use with the tokenizer .
# Currently I have a file consisting of phrases / multi_-_word_expression I want to use with the tokenizer .
# ---
# ...

edited yesterday

answered yesterday

JordiCarrera

513

You can check the exact input and output parameters of the add_mwe method in NLTK's documentation for the class here.

This is the expected input:

>>> tokenizer.add_mwe(('in', 'spite', 'of'))

Hope this helps!

import nltk

from nltk import (
 sent_tokenize as splitter,
 wordpunct_tokenize as tokenizer
)

from nltk.tokenize.mwe import MWETokenizer

test = """Anyone know how to output the tokens produced using MWE Tokenizer?

For a clearer explanation of what I am asking for those who did not understand my original brief question.

The multi-word expression tokenizer (MWETokenizer) provides a method/function (add_mwe()) that allows the user to enter multiple word expressions prior to using the tokenizer on text. Currently I have a file consisting of phrases / multi-word expression I want to use with the tokenizer. My concern is that the manner in which I am presenting the phrases to the function correctly and so not resulting in the desired set of tokens to be used in tokenizing the incoming text. So this leads me to ask if anyone knows how to output the token generated by this method/function so that I can verify that I am correctly passing the phrase to the function (add_mwe()).?"""

mwe = MWETokenizer()

phrases = [
 ('multi', '-', 'word'),
 ('expression', 'tokenizer'),
 ('word', 'expressions'),
 ('multi', '-', 'word', 'expression')
]

for phrase in phrases:
 mwe.add_mwe(phrase)


for sent in splitter(test):
 tokens = tokenizer(sent)
 print ' '.join(tokens)
 print ' '.join(mwe.tokenize(tokens))
 print '---'



# Expected output:
#
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# ---
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# ---
# The multi - word expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word expressions prior to using the tokenizer on text .
# The multi_-_word_expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word_expressions prior to using the tokenizer on text .
# ---
# Currently I have a file consisting of phrases / multi - word expression I want to use with the tokenizer .
# Currently I have a file consisting of phrases / multi_-_word_expression I want to use with the tokenizer .
# ---
# ...

edited yesterday

answered yesterday

JordiCarrera

513

edited yesterday

answered yesterday

JordiCarrera

513

answered yesterday

JordiCarrera

513

answered yesterday

JordiCarrera

513

$begingroup$
Thank you! Exactly what I was hoping to learn.
$endgroup$
– Paul
yesterday

add a comment |

$begingroup$
Thank you! Exactly what I was hoping to learn.
$endgroup$
– Paul
yesterday

Thank you! Exactly what I was hoping to learn.

– Paul
yesterday

add a comment |

Paul is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Paul is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1