Fuzzy name and nickname matchHow to detect product name from the bill text?Accuracy doesn't match in KerasName Tagger in Stanford NLPHow to programmatically suggest changes to a text to match some other text?How to filter Named Entity Recognition resultsBest way to extract information from text description and match it with set of wordsNLP: To remove verb and find the match in a sentenceImportError: cannot import name 'StanfordCoreNLPParser'Determining first name and surname from a name pair?NLP: Fuzzy Word/Phrase Match

Is it possible to have a strip of cold climate in the middle of a planet?

What is this called? Old film camera viewer?

React - map array to child component

Bob has never been a M before

How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?

Is the U.S. Code copyrighted by the Government?

What prevents the use of a multi-segment ILS for non-straight approaches?

Need a math help for the Cagan's model in macroeconomics

Loading commands from file

Is there a single word describing earning money through any means?

Open a doc from terminal, but not by its name

Strong empirical falsification of quantum mechanics based on vacuum energy density

What was the exact wording from Ivanhoe of this advice on how to free yourself from slavery?

Can somebody explain the brexit thing in one or two child-proof sentences?

why `nmap 192.168.1.97` returns less services than `nmap 127.0.0.1`?

what is different between Do you interest vs interested in something?

How do you respond to a colleague from another team when they're wrongly expecting that you'll help them?

Store Credit Card Information in Password Manager?

Pre-mixing cryogenic fuels and using only one fuel tank

Is this toilet slogan correct usage of the English language?

What is the difference between Reference and Background image in 2.8

Melting point of aspirin, contradicting sources

Why should universal income be universal?

Is there a name for this algorithm to calculate the concentration of a mixture of two solutions containing the same solute?



Fuzzy name and nickname match


How to detect product name from the bill text?Accuracy doesn't match in KerasName Tagger in Stanford NLPHow to programmatically suggest changes to a text to match some other text?How to filter Named Entity Recognition resultsBest way to extract information from text description and match it with set of wordsNLP: To remove verb and find the match in a sentenceImportError: cannot import name 'StanfordCoreNLPParser'Determining first name and surname from a name pair?NLP: Fuzzy Word/Phrase Match













3












$begingroup$


I have a dataset with the following structure:



full_name,nickname,match
Christian Douglas,Chris,1,
Jhon Stevens,Charlie,0,
David Jr Simpson,Junior,1
Anastasia Williams,Stacie,1
Lara Williams,Ana,0
John Williams,Willy,1



where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.



I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.



My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.










share|improve this question











$endgroup$





This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.


Looking for an answer drawing from credible and/or official sources.


Should give an important overview/state of the art of the question











  • 1




    $begingroup$
    How man samples are there for training ?
    $endgroup$
    – Shamit Verma
    Mar 19 at 13:38










  • $begingroup$
    Around 100k, but only the 17% have match = 0
    $endgroup$
    – David Masip
    Mar 19 at 13:46










  • $begingroup$
    Is this an open dataset that you can share a link to?
    $endgroup$
    – Adarsh Chavakula
    yesterday











  • $begingroup$
    how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
    $endgroup$
    – iamklaus
    5 hours ago
















3












$begingroup$


I have a dataset with the following structure:



full_name,nickname,match
Christian Douglas,Chris,1,
Jhon Stevens,Charlie,0,
David Jr Simpson,Junior,1
Anastasia Williams,Stacie,1
Lara Williams,Ana,0
John Williams,Willy,1



where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.



I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.



My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.










share|improve this question











$endgroup$





This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.


Looking for an answer drawing from credible and/or official sources.


Should give an important overview/state of the art of the question











  • 1




    $begingroup$
    How man samples are there for training ?
    $endgroup$
    – Shamit Verma
    Mar 19 at 13:38










  • $begingroup$
    Around 100k, but only the 17% have match = 0
    $endgroup$
    – David Masip
    Mar 19 at 13:46










  • $begingroup$
    Is this an open dataset that you can share a link to?
    $endgroup$
    – Adarsh Chavakula
    yesterday











  • $begingroup$
    how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
    $endgroup$
    – iamklaus
    5 hours ago














3












3








3


1



$begingroup$


I have a dataset with the following structure:



full_name,nickname,match
Christian Douglas,Chris,1,
Jhon Stevens,Charlie,0,
David Jr Simpson,Junior,1
Anastasia Williams,Stacie,1
Lara Williams,Ana,0
John Williams,Willy,1



where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.



I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.



My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.










share|improve this question











$endgroup$




I have a dataset with the following structure:



full_name,nickname,match
Christian Douglas,Chris,1,
Jhon Stevens,Charlie,0,
David Jr Simpson,Junior,1
Anastasia Williams,Stacie,1
Lara Williams,Ana,0
John Williams,Willy,1



where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.



I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.



My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.







deep-learning nlp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 19 at 14:17







David Masip

















asked Mar 19 at 13:36









David MasipDavid Masip

2,4941326




2,4941326






This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.


Looking for an answer drawing from credible and/or official sources.


Should give an important overview/state of the art of the question








This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.


Looking for an answer drawing from credible and/or official sources.


Should give an important overview/state of the art of the question









  • 1




    $begingroup$
    How man samples are there for training ?
    $endgroup$
    – Shamit Verma
    Mar 19 at 13:38










  • $begingroup$
    Around 100k, but only the 17% have match = 0
    $endgroup$
    – David Masip
    Mar 19 at 13:46










  • $begingroup$
    Is this an open dataset that you can share a link to?
    $endgroup$
    – Adarsh Chavakula
    yesterday











  • $begingroup$
    how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
    $endgroup$
    – iamklaus
    5 hours ago













  • 1




    $begingroup$
    How man samples are there for training ?
    $endgroup$
    – Shamit Verma
    Mar 19 at 13:38










  • $begingroup$
    Around 100k, but only the 17% have match = 0
    $endgroup$
    – David Masip
    Mar 19 at 13:46










  • $begingroup$
    Is this an open dataset that you can share a link to?
    $endgroup$
    – Adarsh Chavakula
    yesterday











  • $begingroup$
    how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
    $endgroup$
    – iamklaus
    5 hours ago








1




1




$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38




$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38












$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46




$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46












$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday





$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday













$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago





$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago











2 Answers
2






active

oldest

votes


















0












$begingroup$

I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.



Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.



For example



Christian Douglas,Chris,1
Jhon Stevens,Charlie,0


would become



[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]


The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.



The vector of [1/0] is the target variable.



The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.



Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.



Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.



You could give this approach a shot and see if is able to satisfactorily beats baseline models.



A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.






share|improve this answer









$endgroup$




















    0












    $begingroup$

    I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.



    Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.



    Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.



    Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.



    Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.



    Additional resources to explore:
    https://github.com/jamesturk/jellyfish
    https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
    https://pypi.org/project/phonetics/






    share|improve this answer










    New contributor




    ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$












      Your Answer





      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "557"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47617%2ffuzzy-name-and-nickname-match%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0












      $begingroup$

      I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.



      Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.



      For example



      Christian Douglas,Chris,1
      Jhon Stevens,Charlie,0


      would become



      [C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
      [J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]


      The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.



      The vector of [1/0] is the target variable.



      The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.



      Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.



      Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.



      You could give this approach a shot and see if is able to satisfactorily beats baseline models.



      A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.






      share|improve this answer









      $endgroup$

















        0












        $begingroup$

        I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.



        Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.



        For example



        Christian Douglas,Chris,1
        Jhon Stevens,Charlie,0


        would become



        [C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
        [J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]


        The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.



        The vector of [1/0] is the target variable.



        The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.



        Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.



        Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.



        You could give this approach a shot and see if is able to satisfactorily beats baseline models.



        A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.






        share|improve this answer









        $endgroup$















          0












          0








          0





          $begingroup$

          I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.



          Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.



          For example



          Christian Douglas,Chris,1
          Jhon Stevens,Charlie,0


          would become



          [C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
          [J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]


          The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.



          The vector of [1/0] is the target variable.



          The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.



          Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.



          Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.



          You could give this approach a shot and see if is able to satisfactorily beats baseline models.



          A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.






          share|improve this answer









          $endgroup$



          I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.



          Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.



          For example



          Christian Douglas,Chris,1
          Jhon Stevens,Charlie,0


          would become



          [C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
          [J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]


          The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.



          The vector of [1/0] is the target variable.



          The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.



          Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.



          Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.



          You could give this approach a shot and see if is able to satisfactorily beats baseline models.



          A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered yesterday









          Adarsh ChavakulaAdarsh Chavakula

          25525




          25525





















              0












              $begingroup$

              I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.



              Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.



              Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.



              Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.



              Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.



              Additional resources to explore:
              https://github.com/jamesturk/jellyfish
              https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
              https://pypi.org/project/phonetics/






              share|improve this answer










              New contributor




              ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$

















                0












                $begingroup$

                I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.



                Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.



                Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.



                Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.



                Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.



                Additional resources to explore:
                https://github.com/jamesturk/jellyfish
                https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
                https://pypi.org/project/phonetics/






                share|improve this answer










                New contributor




                ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$















                  0












                  0








                  0





                  $begingroup$

                  I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.



                  Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.



                  Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.



                  Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.



                  Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.



                  Additional resources to explore:
                  https://github.com/jamesturk/jellyfish
                  https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
                  https://pypi.org/project/phonetics/






                  share|improve this answer










                  New contributor




                  ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  $endgroup$



                  I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.



                  Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.



                  Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.



                  Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.



                  Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.



                  Additional resources to explore:
                  https://github.com/jamesturk/jellyfish
                  https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
                  https://pypi.org/project/phonetics/







                  share|improve this answer










                  New contributor




                  ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|improve this answer



                  share|improve this answer








                  edited 10 hours ago









                  Siong Thye Goh

                  1,387519




                  1,387519






                  New contributor




                  ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered 12 hours ago









                  ldmtwoldmtwo

                  1012




                  1012




                  New contributor




                  ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47617%2ffuzzy-name-and-nickname-match%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                      Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                      Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High