What algorithms should I use to perform job classification based on resume data? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsResume Parsing - extracting skills from resume using Machine LearningClassification of skills based on job adsMulti-label text classification with minimum confidence thresholdA Text Sections ClassifierWhat techniques should I use to compare the similarity between a bunch of texts?Giving Emails as Input to Machine Learning AlgorithmsCreating labels for Text classification using kerasAdvice on what Machine Learning Algorithms to study for a Job to candidate matching algorithmDocument parsing modeling and approach?Train a model for unstructured dataMulti-class string classification

Why does Python start at index -1 (as opposed to 0) when indexing a list from the end?

Bonus calculation: Am I making a mountain out of a molehill?

How to deal with my PhD supervisors rudely critiquing all my draft papers?

How to find all the available tools in macOS terminal?

What would be the ideal power source for a cybernetic eye?

Do I really need recursive chmod to restrict access to a folder?

What's the difference between `auto x = vector<int>()` and `vector<int> x`?

Dating a Former Employee

Why did the IBM 650 use bi-quinary?

51k Euros annually for a family of 4 in Berlin: Is it enough?

Is there a "higher Segal conjecture"?

Is there a concise way to say "all of the X, one of each"?

How widely used is the term Treppenwitz? Is it something that most Germans know?

If 'B is more likely given A', then 'A is more likely given B'

If a contract sometimes uses the wrong name, is it still valid?

Using et al. for a last / senior author rather than for a first author

When is phishing education going too far?

Output the ŋarâþ crîþ alphabet song without using (m)any letters

Is there a documented rationale why the House Ways and Means chairman can demand tax info?

How to enumerate figures in sync with another counter?

How to deal with a team lead who never gives me credit?

Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?

Is the address of a local variable a constexpr?

How to motivate offshore teams and trust them to deliver?



What algorithms should I use to perform job classification based on resume data?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsResume Parsing - extracting skills from resume using Machine LearningClassification of skills based on job adsMulti-label text classification with minimum confidence thresholdA Text Sections ClassifierWhat techniques should I use to compare the similarity between a bunch of texts?Giving Emails as Input to Machine Learning AlgorithmsCreating labels for Text classification using kerasAdvice on what Machine Learning Algorithms to study for a Job to candidate matching algorithmDocument parsing modeling and approach?Train a model for unstructured dataMulti-class string classification










27












$begingroup$


Note that I am doing everything in R.



The problem goes as follow:



Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .



Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.



My original idea: make this a supervised learning problem.
Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.



Update
Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:



I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...



This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.



Is this approach wrong ? Please correct me if you think my approach is wrong.



Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.



Any ideas would be great.










share|improve this question











$endgroup$
















    27












    $begingroup$


    Note that I am doing everything in R.



    The problem goes as follow:



    Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .



    Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.



    My original idea: make this a supervised learning problem.
    Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.



    Update
    Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:



    I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...



    This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.



    Is this approach wrong ? Please correct me if you think my approach is wrong.



    Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.



    Any ideas would be great.










    share|improve this question











    $endgroup$














      27












      27








      27


      23



      $begingroup$


      Note that I am doing everything in R.



      The problem goes as follow:



      Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .



      Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.



      My original idea: make this a supervised learning problem.
      Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.



      Update
      Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:



      I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...



      This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.



      Is this approach wrong ? Please correct me if you think my approach is wrong.



      Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.



      Any ideas would be great.










      share|improve this question











      $endgroup$




      Note that I am doing everything in R.



      The problem goes as follow:



      Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .



      Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.



      My original idea: make this a supervised learning problem.
      Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.



      Update
      Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:



      I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...



      This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.



      Is this approach wrong ? Please correct me if you think my approach is wrong.



      Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.



      Any ideas would be great.







      machine-learning classification nlp text-mining






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jul 9 '14 at 0:19









      Stephane Rolland

      1134




      1134










      asked Jul 3 '14 at 16:11









      user1769197user1769197

      246135




      246135




















          4 Answers
          4






          active

          oldest

          votes


















          14












          $begingroup$

          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.






          share|improve this answer











          $endgroup$












          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52


















          10












          $begingroup$

          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)






          share|improve this answer











          $endgroup$












          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41


















          7












          $begingroup$

          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37


















          7












          $begingroup$

          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03











          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f662%2fwhat-algorithms-should-i-use-to-perform-job-classification-based-on-resume-data%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          4 Answers
          4






          active

          oldest

          votes








          4 Answers
          4






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          14












          $begingroup$

          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.






          share|improve this answer











          $endgroup$












          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52















          14












          $begingroup$

          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.






          share|improve this answer











          $endgroup$












          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52













          14












          14








          14





          $begingroup$

          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.






          share|improve this answer











          $endgroup$



          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 2 at 2:47









          Stephen Rauch

          1,52551330




          1,52551330










          answered Jul 3 '14 at 17:06









          nfmcclurenfmcclure

          463310




          463310











          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52
















          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52















          $begingroup$
          I would love to see your example.
          $endgroup$
          – user1769197
          Jul 3 '14 at 22:03





          $begingroup$
          I would love to see your example.
          $endgroup$
          – user1769197
          Jul 3 '14 at 22:03













          $begingroup$
          Updated with quick example.
          $endgroup$
          – nfmcclure
          Jul 3 '14 at 22:52




          $begingroup$
          Updated with quick example.
          $endgroup$
          – nfmcclure
          Jul 3 '14 at 22:52











          10












          $begingroup$

          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)






          share|improve this answer











          $endgroup$












          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41















          10












          $begingroup$

          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)






          share|improve this answer











          $endgroup$












          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41













          10












          10








          10





          $begingroup$

          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)






          share|improve this answer











          $endgroup$



          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 23 '17 at 12:38









          Community

          1




          1










          answered Jul 4 '14 at 22:46









          ffriendffriend

          2,4911016




          2,4911016











          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41
















          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41















          $begingroup$
          Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
          $endgroup$
          – user1769197
          Jul 5 '14 at 14:46





          $begingroup$
          Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
          $endgroup$
          – user1769197
          Jul 5 '14 at 14:46













          $begingroup$
          LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
          $endgroup$
          – ffriend
          Jul 5 '14 at 20:33




          $begingroup$
          LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
          $endgroup$
          – ffriend
          Jul 5 '14 at 20:33












          $begingroup$
          @ffriend, How do we get that keyword list ?
          $endgroup$
          – NG_21
          Jan 28 '16 at 9:53




          $begingroup$
          @ffriend, How do we get that keyword list ?
          $endgroup$
          – NG_21
          Jan 28 '16 at 9:53




          1




          1




          $begingroup$
          @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
          $endgroup$
          – Khalid Usman
          Oct 4 '16 at 13:27




          $begingroup$
          @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
          $endgroup$
          – Khalid Usman
          Oct 4 '16 at 13:27




          3




          3




          $begingroup$
          @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
          $endgroup$
          – ffriend
          Oct 4 '16 at 18:41




          $begingroup$
          @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
          $endgroup$
          – ffriend
          Oct 4 '16 at 18:41











          7












          $begingroup$

          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37















          7












          $begingroup$

          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37













          7












          7








          7





          $begingroup$

          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.






          share|improve this answer









          $endgroup$



          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jul 3 '14 at 20:47









          DebasisDebasis

          1,331810




          1,331810











          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37
















          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37















          $begingroup$
          Algorithm-wise: what would you recommend ?
          $endgroup$
          – user1769197
          Jul 3 '14 at 21:59




          $begingroup$
          Algorithm-wise: what would you recommend ?
          $endgroup$
          – user1769197
          Jul 3 '14 at 21:59












          $begingroup$
          you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
          $endgroup$
          – Debasis
          Jul 4 '14 at 11:35




          $begingroup$
          you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
          $endgroup$
          – Debasis
          Jul 4 '14 at 11:35












          $begingroup$
          I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
          $endgroup$
          – user1769197
          Jul 4 '14 at 13:32




          $begingroup$
          I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
          $endgroup$
          – user1769197
          Jul 4 '14 at 13:32












          $begingroup$
          these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
          $endgroup$
          – Debasis
          Jul 4 '14 at 15:23




          $begingroup$
          these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
          $endgroup$
          – Debasis
          Jul 4 '14 at 15:23












          $begingroup$
          I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
          $endgroup$
          – user1769197
          Jul 4 '14 at 16:37




          $begingroup$
          I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
          $endgroup$
          – user1769197
          Jul 4 '14 at 16:37











          7












          $begingroup$

          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03















          7












          $begingroup$

          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03













          7












          7








          7





          $begingroup$

          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.






          share|improve this answer









          $endgroup$



          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jul 7 '14 at 18:36









          SimonSimon

          68168




          68168











          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03
















          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03















          $begingroup$
          Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
          $endgroup$
          – user1769197
          Jul 8 '14 at 15:40




          $begingroup$
          Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
          $endgroup$
          – user1769197
          Jul 8 '14 at 15:40












          $begingroup$
          You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
          $endgroup$
          – Simon
          May 18 '15 at 19:54




          $begingroup$
          You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
          $endgroup$
          – Simon
          May 18 '15 at 19:54












          $begingroup$
          @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
          $endgroup$
          – Khalid Usman
          Oct 4 '16 at 13:17





          $begingroup$
          @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
          $endgroup$
          – Khalid Usman
          Oct 4 '16 at 13:17













          $begingroup$
          @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
          $endgroup$
          – Simon
          Oct 16 '16 at 19:27




          $begingroup$
          @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
          $endgroup$
          – Simon
          Oct 16 '16 at 19:27












          $begingroup$
          @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
          $endgroup$
          – Khalid Usman
          Oct 17 '16 at 11:03




          $begingroup$
          @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
          $endgroup$
          – Khalid Usman
          Oct 17 '16 at 11:03

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f662%2fwhat-algorithms-should-i-use-to-perform-job-classification-based-on-resume-data%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High