What algorithms should I use to perform job classification based on resume data? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsResume Parsing - extracting skills from resume using Machine LearningClassification of skills based on job adsMulti-label text classification with minimum confidence thresholdA Text Sections ClassifierWhat techniques should I use to compare the similarity between a bunch of texts?Giving Emails as Input to Machine Learning AlgorithmsCreating labels for Text classification using kerasAdvice on what Machine Learning Algorithms to study for a Job to candidate matching algorithmDocument parsing modeling and approach?Train a model for unstructured dataMulti-class string classification

Why does Python start at index -1 (as opposed to 0) when indexing a list from the end?

Bonus calculation: Am I making a mountain out of a molehill?

How to deal with my PhD supervisors rudely critiquing all my draft papers?

How to find all the available tools in macOS terminal?

What would be the ideal power source for a cybernetic eye?

Do I really need recursive chmod to restrict access to a folder?

What's the difference between `auto x = vector<int>()` and `vector<int> x`?

Dating a Former Employee

Why did the IBM 650 use bi-quinary?

51k Euros annually for a family of 4 in Berlin: Is it enough?

Is there a "higher Segal conjecture"?

Is there a concise way to say "all of the X, one of each"?

How widely used is the term Treppenwitz? Is it something that most Germans know?

If 'B is more likely given A', then 'A is more likely given B'

If a contract sometimes uses the wrong name, is it still valid?

Using et al. for a last / senior author rather than for a first author

When is phishing education going too far?

Output the ŋarâþ crîþ alphabet song without using (m)any letters

Is there a documented rationale why the House Ways and Means chairman can demand tax info?

How to enumerate figures in sync with another counter?

How to deal with a team lead who never gives me credit?

Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?

Is the address of a local variable a constexpr?

How to motivate offshore teams and trust them to deliver?

What algorithms should I use to perform job classification based on resume data?

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsResume Parsing - extracting skills from resume using Machine LearningClassification of skills based on job adsMulti-label text classification with minimum confidence thresholdA Text Sections ClassifierWhat techniques should I use to compare the similarity between a bunch of texts?Giving Emails as Input to Machine Learning AlgorithmsCreating labels for Text classification using kerasAdvice on what Machine Learning Algorithms to study for a Job to candidate matching algorithmDocument parsing modeling and approach?Train a model for unstructured dataMulti-class string classification

Note that I am doing everything in R.

The problem goes as follow:

Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .

Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.

My original idea: make this a supervised learning problem.
Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.

Update
Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:

I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...

This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.

Is this approach wrong ? Please correct me if you think my approach is wrong.

Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.

Any ideas would be great.

edited Jul 9 '14 at 0:19

Stephane Rolland

1134

asked Jul 3 '14 at 16:11

user1769197

246135

add a comment |

Note that I am doing everything in R.

The problem goes as follow:

I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...

This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.

Is this approach wrong ? Please correct me if you think my approach is wrong.

Any ideas would be great.

edited Jul 9 '14 at 0:19

Stephane Rolland

1134

asked Jul 3 '14 at 16:11

user1769197

246135

add a comment |

Note that I am doing everything in R.

The problem goes as follow:

I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...

This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.

Is this approach wrong ? Please correct me if you think my approach is wrong.

Any ideas would be great.

edited Jul 9 '14 at 0:19

Stephane Rolland

1134

asked Jul 3 '14 at 16:11

user1769197

246135

Note that I am doing everything in R.

The problem goes as follow:

I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...

This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.

Is this approach wrong ? Please correct me if you think my approach is wrong.

Any ideas would be great.

machine-learning classification nlp text-mining

edited Jul 9 '14 at 0:19

Stephane Rolland

1134

asked Jul 3 '14 at 16:11

user1769197

246135

edited Jul 9 '14 at 0:19

Stephane Rolland

1134

asked Jul 3 '14 at 16:11

user1769197

246135

edited Jul 9 '14 at 0:19

Stephane Rolland

1134

edited Jul 9 '14 at 0:19

Stephane Rolland

1134

edited Jul 9 '14 at 0:19

Stephane Rolland

1134

asked Jul 3 '14 at 16:11

user1769197

246135

asked Jul 3 '14 at 16:11

user1769197

246135

asked Jul 3 '14 at 16:11

user1769197

246135

add a comment |

4 Answers
4

active

oldest

votes

Check out this link.

Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.

You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.

Example:

1) Load libraries and build the example data

library(tm)
library(SnowballC)

doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
 "job" = c(job1,job2,job3))

2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.

# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)

# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x)
 paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
)

# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x) 
 paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
)

3) Make a corpus source and document term matrix.

# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))

# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)

# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)

Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.

Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.

Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.

edited Apr 2 at 2:47

Stephen Rauch♦

1,52551330

answered Jul 3 '14 at 17:06

nfmcclure

463310

$begingroup$
I would love to see your example.
$endgroup$
– user1769197
Jul 3 '14 at 22:03

$begingroup$
Updated with quick example.
$endgroup$
– nfmcclure
Jul 3 '14 at 22:52

add a comment |

Just extract keywords and train a classifier on them. That's all, really.

Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.

Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.

So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).

(Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)

edited May 23 '17 at 12:38

Community♦

answered Jul 4 '14 at 22:46

ffriend

2,4911016

$begingroup$
Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
$endgroup$
– user1769197
Jul 5 '14 at 14:46

$begingroup$
LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
$endgroup$
– ffriend
Jul 5 '14 at 20:33

$begingroup$
@ffriend, How do we get that keyword list ?
$endgroup$
– NG_21
Jan 28 '16 at 9:53

1

$begingroup$
@ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:27

3

$begingroup$
@KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
$endgroup$
– ffriend
Oct 4 '16 at 18:41

|
show 1 more comment

This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".

The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.

After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.

Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)

Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)

In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.

Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.

answered Jul 3 '14 at 20:47

Debasis

1,331810

$begingroup$
Algorithm-wise: what would you recommend ?
$endgroup$
– user1769197
Jul 3 '14 at 21:59

$begingroup$
you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
$endgroup$
– Debasis
Jul 4 '14 at 11:35

$begingroup$
I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
$endgroup$
– user1769197
Jul 4 '14 at 13:32

$begingroup$
these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
$endgroup$
– Debasis
Jul 4 '14 at 15:23

$begingroup$
I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
$endgroup$
– user1769197
Jul 4 '14 at 16:37

|
show 2 more comments

I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.

in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.

answered Jul 7 '14 at 18:36

Simon

68168

$begingroup$
Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
$endgroup$
– user1769197
Jul 8 '14 at 15:40

$begingroup$
You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
$endgroup$
– Simon
May 18 '15 at 19:54

$begingroup$
@Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:17

$begingroup$
@KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
$endgroup$
– Simon
Oct 16 '16 at 19:27

$begingroup$
@Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
$endgroup$
– Khalid Usman
Oct 17 '16 at 11:03

|
show 2 more comments

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f662%2fwhat-algorithms-should-i-use-to-perform-job-classification-based-on-resume-data%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Check out this link.

You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.

Example:

1) Load libraries and build the example data

library(tm)
library(SnowballC)

doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
 "job" = c(job1,job2,job3))

2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.

# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)

# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x)
 paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
)

# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x) 
 paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
)

3) Make a corpus source and document term matrix.

# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))

# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)

# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)

Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.

Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.

edited Apr 2 at 2:47

Stephen Rauch♦

1,52551330

answered Jul 3 '14 at 17:06

nfmcclure

463310

$begingroup$
I would love to see your example.
$endgroup$
– user1769197
Jul 3 '14 at 22:03

$begingroup$
Updated with quick example.
$endgroup$
– nfmcclure
Jul 3 '14 at 22:52

add a comment |

Check out this link.

You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.

Example:

1) Load libraries and build the example data

library(tm)
library(SnowballC)

doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
 "job" = c(job1,job2,job3))

2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.

# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)

# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x)
 paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
)

# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x) 
 paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
)

3) Make a corpus source and document term matrix.

# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))

# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)

# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)

Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.

Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.

edited Apr 2 at 2:47

Stephen Rauch♦

1,52551330

answered Jul 3 '14 at 17:06

nfmcclure

463310

$begingroup$
I would love to see your example.
$endgroup$
– user1769197
Jul 3 '14 at 22:03

$begingroup$
Updated with quick example.
$endgroup$
– nfmcclure
Jul 3 '14 at 22:52

add a comment |

Check out this link.

You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.

Example:

1) Load libraries and build the example data

library(tm)
library(SnowballC)

doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
 "job" = c(job1,job2,job3))

2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.

# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)

# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x)
 paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
)

# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x) 
 paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
)

3) Make a corpus source and document term matrix.

# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))

# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)

# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)

Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.

Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.

edited Apr 2 at 2:47

Stephen Rauch♦

1,52551330

answered Jul 3 '14 at 17:06

nfmcclure

463310

Check out this link.

You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.

Example:

1) Load libraries and build the example data

library(tm)
library(SnowballC)

doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
 "job" = c(job1,job2,job3))

2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.

# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)

# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x)
 paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
)

# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x) 
 paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
)

3) Make a corpus source and document term matrix.

# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))

# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)

# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)

Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.

Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.

edited Apr 2 at 2:47

Stephen Rauch♦

1,52551330

answered Jul 3 '14 at 17:06

nfmcclure

463310

edited Apr 2 at 2:47

Stephen Rauch♦

1,52551330

edited Apr 2 at 2:47

Stephen Rauch♦

1,52551330

edited Apr 2 at 2:47

Stephen Rauch♦

1,52551330

answered Jul 3 '14 at 17:06

nfmcclure

463310

answered Jul 3 '14 at 17:06

nfmcclure

463310

answered Jul 3 '14 at 17:06

nfmcclure

463310

$begingroup$
I would love to see your example.
$endgroup$
– user1769197
Jul 3 '14 at 22:03

$begingroup$
Updated with quick example.
$endgroup$
– nfmcclure
Jul 3 '14 at 22:52

add a comment |

$begingroup$
I would love to see your example.
$endgroup$
– user1769197
Jul 3 '14 at 22:03

$begingroup$
Updated with quick example.
$endgroup$
– nfmcclure
Jul 3 '14 at 22:52

I would love to see your example.

– user1769197
Jul 3 '14 at 22:03

Updated with quick example.

– nfmcclure
Jul 3 '14 at 22:52

add a comment |

Just extract keywords and train a classifier on them. That's all, really.

(Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)

edited May 23 '17 at 12:38

Community♦

answered Jul 4 '14 at 22:46

ffriend

2,4911016

$begingroup$
Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
$endgroup$
– user1769197
Jul 5 '14 at 14:46

$begingroup$
LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
$endgroup$
– ffriend
Jul 5 '14 at 20:33

$begingroup$
@ffriend, How do we get that keyword list ?
$endgroup$
– NG_21
Jan 28 '16 at 9:53

1

$begingroup$
@ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:27

3

$begingroup$
@KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
$endgroup$
– ffriend
Oct 4 '16 at 18:41

|
show 1 more comment

Just extract keywords and train a classifier on them. That's all, really.

(Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)

edited May 23 '17 at 12:38

Community♦

answered Jul 4 '14 at 22:46

ffriend

2,4911016

$begingroup$
Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
$endgroup$
– user1769197
Jul 5 '14 at 14:46

$begingroup$
LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
$endgroup$
– ffriend
Jul 5 '14 at 20:33

$begingroup$
@ffriend, How do we get that keyword list ?
$endgroup$
– NG_21
Jan 28 '16 at 9:53

1

$begingroup$
@ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:27

3

$begingroup$
@KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
$endgroup$
– ffriend
Oct 4 '16 at 18:41

|
show 1 more comment

Just extract keywords and train a classifier on them. That's all, really.

(Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)

edited May 23 '17 at 12:38

Community♦

answered Jul 4 '14 at 22:46

ffriend

2,4911016

Just extract keywords and train a classifier on them. That's all, really.

(Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)

edited May 23 '17 at 12:38

Community♦

answered Jul 4 '14 at 22:46

ffriend

2,4911016

edited May 23 '17 at 12:38

Community♦

edited May 23 '17 at 12:38

Community♦

edited May 23 '17 at 12:38

Community♦

answered Jul 4 '14 at 22:46

ffriend

2,4911016

answered Jul 4 '14 at 22:46

ffriend

2,4911016

answered Jul 4 '14 at 22:46

ffriend

2,4911016

$begingroup$
Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
$endgroup$
– user1769197
Jul 5 '14 at 14:46

$begingroup$
LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
$endgroup$
– ffriend
Jul 5 '14 at 20:33

$begingroup$
@ffriend, How do we get that keyword list ?
$endgroup$
– NG_21
Jan 28 '16 at 9:53

1

$begingroup$
@ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:27

3

$begingroup$
@KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
$endgroup$
– ffriend
Oct 4 '16 at 18:41

|
show 1 more comment

$begingroup$
Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
$endgroup$
– user1769197
Jul 5 '14 at 14:46

$begingroup$
LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
$endgroup$
– ffriend
Jul 5 '14 at 20:33

$begingroup$
@ffriend, How do we get that keyword list ?
$endgroup$
– NG_21
Jan 28 '16 at 9:53

1

$begingroup$
@ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:27

3

$begingroup$
@KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
$endgroup$
– ffriend
Oct 4 '16 at 18:41

Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?

– user1769197
Jul 5 '14 at 14:46

LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.

– ffriend
Jul 5 '14 at 20:33

@ffriend, How do we get that keyword list ?

– NG_21
Jan 28 '16 at 9:53

@ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks

– Khalid Usman
Oct 4 '16 at 13:27

@KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.

– ffriend
Oct 4 '16 at 18:41

|
show 1 more comment

Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)

Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)

answered Jul 3 '14 at 20:47

Debasis

1,331810

$begingroup$
Algorithm-wise: what would you recommend ?
$endgroup$
– user1769197
Jul 3 '14 at 21:59

$begingroup$
you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
$endgroup$
– Debasis
Jul 4 '14 at 11:35

$begingroup$
I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
$endgroup$
– user1769197
Jul 4 '14 at 13:32

$begingroup$
these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
$endgroup$
– Debasis
Jul 4 '14 at 15:23

$begingroup$
I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
$endgroup$
– user1769197
Jul 4 '14 at 16:37

|
show 2 more comments

Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)

Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)

answered Jul 3 '14 at 20:47

Debasis

1,331810

$begingroup$
Algorithm-wise: what would you recommend ?
$endgroup$
– user1769197
Jul 3 '14 at 21:59

$begingroup$
you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
$endgroup$
– Debasis
Jul 4 '14 at 11:35

$begingroup$
I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
$endgroup$
– user1769197
Jul 4 '14 at 13:32

$begingroup$
these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
$endgroup$
– Debasis
Jul 4 '14 at 15:23

$begingroup$
I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
$endgroup$
– user1769197
Jul 4 '14 at 16:37

|
show 2 more comments

Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)

Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)

answered Jul 3 '14 at 20:47

Debasis

1,331810

Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)

Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)

answered Jul 3 '14 at 20:47

Debasis

1,331810

answered Jul 3 '14 at 20:47

Debasis

1,331810

answered Jul 3 '14 at 20:47

Debasis

1,331810

answered Jul 3 '14 at 20:47

Debasis

1,331810

$begingroup$
Algorithm-wise: what would you recommend ?
$endgroup$
– user1769197
Jul 3 '14 at 21:59

$begingroup$
you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
$endgroup$
– Debasis
Jul 4 '14 at 11:35

$begingroup$
I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
$endgroup$
– user1769197
Jul 4 '14 at 13:32

$begingroup$
these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
$endgroup$
– Debasis
Jul 4 '14 at 15:23

$begingroup$
I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
$endgroup$
– user1769197
Jul 4 '14 at 16:37

|
show 2 more comments

$begingroup$
Algorithm-wise: what would you recommend ?
$endgroup$
– user1769197
Jul 3 '14 at 21:59

$begingroup$
you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
$endgroup$
– Debasis
Jul 4 '14 at 11:35

$begingroup$
I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
$endgroup$
– user1769197
Jul 4 '14 at 13:32

$begingroup$
these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
$endgroup$
– Debasis
Jul 4 '14 at 15:23

$begingroup$
I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
$endgroup$
– user1769197
Jul 4 '14 at 16:37

Algorithm-wise: what would you recommend ?

– user1769197
Jul 3 '14 at 21:59

you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...

– Debasis
Jul 4 '14 at 11:35

I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?

– user1769197
Jul 4 '14 at 13:32

these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).

– Debasis
Jul 4 '14 at 15:23

I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?

– user1769197
Jul 4 '14 at 16:37

|
show 2 more comments

answered Jul 7 '14 at 18:36

Simon

68168

$begingroup$
Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
$endgroup$
– user1769197
Jul 8 '14 at 15:40

$begingroup$
You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
$endgroup$
– Simon
May 18 '15 at 19:54

$begingroup$
@Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:17

$begingroup$
@KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
$endgroup$
– Simon
Oct 16 '16 at 19:27

$begingroup$
@Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
$endgroup$
– Khalid Usman
Oct 17 '16 at 11:03

|
show 2 more comments

answered Jul 7 '14 at 18:36

Simon

68168

$begingroup$
Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
$endgroup$
– user1769197
Jul 8 '14 at 15:40

$begingroup$
You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
$endgroup$
– Simon
May 18 '15 at 19:54

$begingroup$
@Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:17

$begingroup$
@KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
$endgroup$
– Simon
Oct 16 '16 at 19:27

$begingroup$
@Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
$endgroup$
– Khalid Usman
Oct 17 '16 at 11:03

|
show 2 more comments

answered Jul 7 '14 at 18:36

Simon

68168

answered Jul 7 '14 at 18:36

Simon

68168

answered Jul 7 '14 at 18:36

Simon

68168

answered Jul 7 '14 at 18:36

Simon

68168

answered Jul 7 '14 at 18:36

Simon

68168

$begingroup$
Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
$endgroup$
– user1769197
Jul 8 '14 at 15:40

$begingroup$
You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
$endgroup$
– Simon
May 18 '15 at 19:54

$begingroup$
@Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:17

$begingroup$
@KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
$endgroup$
– Simon
Oct 16 '16 at 19:27

$begingroup$
@Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
$endgroup$
– Khalid Usman
Oct 17 '16 at 11:03

|
show 2 more comments

$begingroup$
Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
$endgroup$
– user1769197
Jul 8 '14 at 15:40

$begingroup$
You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
$endgroup$
– Simon
May 18 '15 at 19:54

$begingroup$
@Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
$endgroup$
– Khalid Usman
Oct 4 '16 at 13:17

$begingroup$
@KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
$endgroup$
– Simon
Oct 16 '16 at 19:27

$begingroup$
@Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
$endgroup$
– Khalid Usman
Oct 17 '16 at 11:03

Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?

– user1769197
Jul 8 '14 at 15:40

You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.

– Simon
May 18 '15 at 19:54

@Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks

– Khalid Usman
Oct 4 '16 at 13:17

@KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.

– Simon
Oct 16 '16 at 19:27

@Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks

– Khalid Usman
Oct 17 '16 at 11:03

|
show 2 more comments

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

4oBY1pVM9F vwE,3sRD A EglIztF5gB07,o2yvr

搜尋此網誌

Trjtdtk

4 Answers
4

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

4 Answers
4

4 Answers
4

4 Answers
4