Automatic labelling of text data based on predefined entities The Next CEO of Stack Overflow2019 Community Moderator ElectionHow to label a sentiment using NLP?Automatic question categorization when we know important words in each categoryCreating labels for Text classification using kerasMaking bigram features from a particular datasetText extraction from documents using NLP or Deep Learninghow to input the data set in to a word2vec by keras?Interactive labeling/annotating of time series dataSupervised training based Tagging/Labelling for entitiesDetect sensitive data from unstructured text documents

Upgrading From a 9 Speed Sora Derailleur?

Find the majority element, which appears more than half the time

Is a linearly independent set whose span is dense a Schauder basis?

What is the difference between 'contrib' and 'non-free' packages repositories?

Mathematica command that allows it to read my intentions

Do I need to write [sic] when including a quotation with a number less than 10 that isn't written out?

Oldie but Goldie

Small nick on power cord from an electric alarm clock, and copper wiring exposed but intact

Why can't we say "I have been having a dog"?

Which acid/base does a strong base/acid react when added to a buffer solution?

Find a path from s to t using as few red nodes as possible

What did the word "leisure" mean in late 18th Century usage?

Is it reasonable to ask other researchers to send me their previous grant applications?

Is it possible to create a QR code using text?

Is it a bad idea to plug the other end of ESD strap to wall ground?

Prodigo = pro + ago?

Masking layers by a vector polygon layer in QGIS

Arrows in tikz Markov chain diagram overlap

Gödel's incompleteness theorems - what are the religious implications?

Man transported from Alternate World into ours by a Neutrino Detector

Could a dragon use its wings to swim?

Why do we say “un seul M” and not “une seule M” even though M is a “consonne”?

My boss doesn't want me to have a side project

What happens if you break a law in another country outside of that country?



Automatic labelling of text data based on predefined entities



The Next CEO of Stack Overflow
2019 Community Moderator ElectionHow to label a sentiment using NLP?Automatic question categorization when we know important words in each categoryCreating labels for Text classification using kerasMaking bigram features from a particular datasetText extraction from documents using NLP or Deep Learninghow to input the data set in to a word2vec by keras?Interactive labeling/annotating of time series dataSupervised training based Tagging/Labelling for entitiesDetect sensitive data from unstructured text documents










0












$begingroup$


I'm new to NLP.
I have a folder containing .txt files which are legal and specific documents. I want to label all these files based on four predefined labels. How can I do that automatically?










share|improve this question









$endgroup$











  • $begingroup$
    What kinds of labels? You want to put the entire text into categories? Do you want to find where in the text a category can be found? Please give some details on what you are trying to achieve.
    $endgroup$
    – Simon Larsson
    Mar 25 at 13:49











  • $begingroup$
    Labels are: 1) Money, 2) judge, 3) tribunal, 4) state of the sentence (is a binary rejected or accepted). I want to label every text file with the following labels, so ideally I want to put the files and their corresponding labels in the same folder.
    $endgroup$
    – GiuliaC.
    Mar 25 at 14:07















0












$begingroup$


I'm new to NLP.
I have a folder containing .txt files which are legal and specific documents. I want to label all these files based on four predefined labels. How can I do that automatically?










share|improve this question









$endgroup$











  • $begingroup$
    What kinds of labels? You want to put the entire text into categories? Do you want to find where in the text a category can be found? Please give some details on what you are trying to achieve.
    $endgroup$
    – Simon Larsson
    Mar 25 at 13:49











  • $begingroup$
    Labels are: 1) Money, 2) judge, 3) tribunal, 4) state of the sentence (is a binary rejected or accepted). I want to label every text file with the following labels, so ideally I want to put the files and their corresponding labels in the same folder.
    $endgroup$
    – GiuliaC.
    Mar 25 at 14:07













0












0








0





$begingroup$


I'm new to NLP.
I have a folder containing .txt files which are legal and specific documents. I want to label all these files based on four predefined labels. How can I do that automatically?










share|improve this question









$endgroup$




I'm new to NLP.
I have a folder containing .txt files which are legal and specific documents. I want to label all these files based on four predefined labels. How can I do that automatically?







nlp data labels






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 25 at 13:41









GiuliaC.GiuliaC.

1




1











  • $begingroup$
    What kinds of labels? You want to put the entire text into categories? Do you want to find where in the text a category can be found? Please give some details on what you are trying to achieve.
    $endgroup$
    – Simon Larsson
    Mar 25 at 13:49











  • $begingroup$
    Labels are: 1) Money, 2) judge, 3) tribunal, 4) state of the sentence (is a binary rejected or accepted). I want to label every text file with the following labels, so ideally I want to put the files and their corresponding labels in the same folder.
    $endgroup$
    – GiuliaC.
    Mar 25 at 14:07
















  • $begingroup$
    What kinds of labels? You want to put the entire text into categories? Do you want to find where in the text a category can be found? Please give some details on what you are trying to achieve.
    $endgroup$
    – Simon Larsson
    Mar 25 at 13:49











  • $begingroup$
    Labels are: 1) Money, 2) judge, 3) tribunal, 4) state of the sentence (is a binary rejected or accepted). I want to label every text file with the following labels, so ideally I want to put the files and their corresponding labels in the same folder.
    $endgroup$
    – GiuliaC.
    Mar 25 at 14:07















$begingroup$
What kinds of labels? You want to put the entire text into categories? Do you want to find where in the text a category can be found? Please give some details on what you are trying to achieve.
$endgroup$
– Simon Larsson
Mar 25 at 13:49





$begingroup$
What kinds of labels? You want to put the entire text into categories? Do you want to find where in the text a category can be found? Please give some details on what you are trying to achieve.
$endgroup$
– Simon Larsson
Mar 25 at 13:49













$begingroup$
Labels are: 1) Money, 2) judge, 3) tribunal, 4) state of the sentence (is a binary rejected or accepted). I want to label every text file with the following labels, so ideally I want to put the files and their corresponding labels in the same folder.
$endgroup$
– GiuliaC.
Mar 25 at 14:07




$begingroup$
Labels are: 1) Money, 2) judge, 3) tribunal, 4) state of the sentence (is a binary rejected or accepted). I want to label every text file with the following labels, so ideally I want to put the files and their corresponding labels in the same folder.
$endgroup$
– GiuliaC.
Mar 25 at 14:07










1 Answer
1






active

oldest

votes


















0












$begingroup$

The task you have is called named-entity recognition. From wiki:




Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.




Since this is a common NLP task there are libraries that are made to do NER out of the box. One such library is spaCy which can do NER as well as many other NLP tasks using Python.



You will not be able to perform NER without first training a model on your custom labels/entities. You need to have some labelled data to train on, maybe you already have this or you can label it manually. SpaCy wants yo have the data labelled with location of each entity on the format:



[("legal text here", "entities": [(Start index, End index, "Money"), 
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")]),
("legal text here", "entities": [(Start index, End index, "Money"),
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")])
...]


Example on how to training a spaCy model for NER (taken from docs):



from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# training data
TRAIN_DATA = Insert you labelled training data here

@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses =
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)

# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])


if __name__ == "__main__":
plac.call(main)


Then when you have a trained model you can use it to get your entities:



doc = nlp('put legal text to test your model here')

for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)





share|improve this answer











$endgroup$












  • $begingroup$
    I think OP is talking about document classification rather than NER.
    $endgroup$
    – Esmailian
    Mar 25 at 17:00






  • 1




    $begingroup$
    For me the labels only made sense with NER. Money, judge, tribunal and state of the sentence does not really seem like categories to sort entire documents into. But the question is quite vague so my assumption might be off. :)
    $endgroup$
    – Simon Larsson
    Mar 25 at 17:07











Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47941%2fautomatic-labelling-of-text-data-based-on-predefined-entities%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

The task you have is called named-entity recognition. From wiki:




Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.




Since this is a common NLP task there are libraries that are made to do NER out of the box. One such library is spaCy which can do NER as well as many other NLP tasks using Python.



You will not be able to perform NER without first training a model on your custom labels/entities. You need to have some labelled data to train on, maybe you already have this or you can label it manually. SpaCy wants yo have the data labelled with location of each entity on the format:



[("legal text here", "entities": [(Start index, End index, "Money"), 
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")]),
("legal text here", "entities": [(Start index, End index, "Money"),
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")])
...]


Example on how to training a spaCy model for NER (taken from docs):



from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# training data
TRAIN_DATA = Insert you labelled training data here

@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses =
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)

# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])


if __name__ == "__main__":
plac.call(main)


Then when you have a trained model you can use it to get your entities:



doc = nlp('put legal text to test your model here')

for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)





share|improve this answer











$endgroup$












  • $begingroup$
    I think OP is talking about document classification rather than NER.
    $endgroup$
    – Esmailian
    Mar 25 at 17:00






  • 1




    $begingroup$
    For me the labels only made sense with NER. Money, judge, tribunal and state of the sentence does not really seem like categories to sort entire documents into. But the question is quite vague so my assumption might be off. :)
    $endgroup$
    – Simon Larsson
    Mar 25 at 17:07















0












$begingroup$

The task you have is called named-entity recognition. From wiki:




Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.




Since this is a common NLP task there are libraries that are made to do NER out of the box. One such library is spaCy which can do NER as well as many other NLP tasks using Python.



You will not be able to perform NER without first training a model on your custom labels/entities. You need to have some labelled data to train on, maybe you already have this or you can label it manually. SpaCy wants yo have the data labelled with location of each entity on the format:



[("legal text here", "entities": [(Start index, End index, "Money"), 
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")]),
("legal text here", "entities": [(Start index, End index, "Money"),
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")])
...]


Example on how to training a spaCy model for NER (taken from docs):



from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# training data
TRAIN_DATA = Insert you labelled training data here

@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses =
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)

# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])


if __name__ == "__main__":
plac.call(main)


Then when you have a trained model you can use it to get your entities:



doc = nlp('put legal text to test your model here')

for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)





share|improve this answer











$endgroup$












  • $begingroup$
    I think OP is talking about document classification rather than NER.
    $endgroup$
    – Esmailian
    Mar 25 at 17:00






  • 1




    $begingroup$
    For me the labels only made sense with NER. Money, judge, tribunal and state of the sentence does not really seem like categories to sort entire documents into. But the question is quite vague so my assumption might be off. :)
    $endgroup$
    – Simon Larsson
    Mar 25 at 17:07













0












0








0





$begingroup$

The task you have is called named-entity recognition. From wiki:




Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.




Since this is a common NLP task there are libraries that are made to do NER out of the box. One such library is spaCy which can do NER as well as many other NLP tasks using Python.



You will not be able to perform NER without first training a model on your custom labels/entities. You need to have some labelled data to train on, maybe you already have this or you can label it manually. SpaCy wants yo have the data labelled with location of each entity on the format:



[("legal text here", "entities": [(Start index, End index, "Money"), 
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")]),
("legal text here", "entities": [(Start index, End index, "Money"),
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")])
...]


Example on how to training a spaCy model for NER (taken from docs):



from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# training data
TRAIN_DATA = Insert you labelled training data here

@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses =
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)

# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])


if __name__ == "__main__":
plac.call(main)


Then when you have a trained model you can use it to get your entities:



doc = nlp('put legal text to test your model here')

for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)





share|improve this answer











$endgroup$



The task you have is called named-entity recognition. From wiki:




Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.




Since this is a common NLP task there are libraries that are made to do NER out of the box. One such library is spaCy which can do NER as well as many other NLP tasks using Python.



You will not be able to perform NER without first training a model on your custom labels/entities. You need to have some labelled data to train on, maybe you already have this or you can label it manually. SpaCy wants yo have the data labelled with location of each entity on the format:



[("legal text here", "entities": [(Start index, End index, "Money"), 
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")]),
("legal text here", "entities": [(Start index, End index, "Money"),
(Start index, End index, "Judge"),
(Start index, End index, "Tribunal"),
(Start index, End index, "State")])
...]


Example on how to training a spaCy model for NER (taken from docs):



from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# training data
TRAIN_DATA = Insert you labelled training data here

@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses =
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)

# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])


if __name__ == "__main__":
plac.call(main)


Then when you have a trained model you can use it to get your entities:



doc = nlp('put legal text to test your model here')

for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 25 at 16:26

























answered Mar 25 at 14:38









Simon LarssonSimon Larsson

563112




563112











  • $begingroup$
    I think OP is talking about document classification rather than NER.
    $endgroup$
    – Esmailian
    Mar 25 at 17:00






  • 1




    $begingroup$
    For me the labels only made sense with NER. Money, judge, tribunal and state of the sentence does not really seem like categories to sort entire documents into. But the question is quite vague so my assumption might be off. :)
    $endgroup$
    – Simon Larsson
    Mar 25 at 17:07
















  • $begingroup$
    I think OP is talking about document classification rather than NER.
    $endgroup$
    – Esmailian
    Mar 25 at 17:00






  • 1




    $begingroup$
    For me the labels only made sense with NER. Money, judge, tribunal and state of the sentence does not really seem like categories to sort entire documents into. But the question is quite vague so my assumption might be off. :)
    $endgroup$
    – Simon Larsson
    Mar 25 at 17:07















$begingroup$
I think OP is talking about document classification rather than NER.
$endgroup$
– Esmailian
Mar 25 at 17:00




$begingroup$
I think OP is talking about document classification rather than NER.
$endgroup$
– Esmailian
Mar 25 at 17:00




1




1




$begingroup$
For me the labels only made sense with NER. Money, judge, tribunal and state of the sentence does not really seem like categories to sort entire documents into. But the question is quite vague so my assumption might be off. :)
$endgroup$
– Simon Larsson
Mar 25 at 17:07




$begingroup$
For me the labels only made sense with NER. Money, judge, tribunal and state of the sentence does not really seem like categories to sort entire documents into. But the question is quite vague so my assumption might be off. :)
$endgroup$
– Simon Larsson
Mar 25 at 17:07

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47941%2fautomatic-labelling-of-text-data-based-on-predefined-entities%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High