Can I create a good Speech Recognition Engine while having millions of recorded conversations?2019 Community Moderator ElectionHow can I create a classifier using the feature map of a CNN?How can I create a space in IBM Cloud?speech accent recognition data augmentation and trainingwhich algorithm will be good for detecting and recognition of faces from variety of anglesHow can I create a negation of the sentence?Can I create pretrain model with tensorflow?Opensource Speech Recognition Library that is secure and trained on large dataIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?

Prime joint compound before latex paint?

Unbreakable Formation vs. Cry of the Carnarium

Filling an area between two curves

Is it possible to make sharp wind that can cut stuff from afar?

Is there a name of the flying bionic bird?

What is the command to reset a PC without deleting any files

Why was the "bread communication" in the arena of Catching Fire left out in the movie?

Symmetry in quantum mechanics

"My colleague's body is amazing"

Information to fellow intern about hiring?

New order #4: World

Does the average primeness of natural numbers tend to zero?

Lied on resume at previous job

How to make payment on the internet without leaving a money trail?

Find the positive root of a 4-th degree polynomial equation

Why is my log file so massive? 22gb. I am running log backups

Could a US political party gain complete control over the government by removing checks & balances?

Where else does the Shulchan Aruch quote an authority by name?

COUNT(*) or MAX(id) - which is faster?

How to manage monthly salary

What to wear for invited talk in Canada

How is it possible for user's password to be changed after storage was encrypted? (on OS X, Android)

Map list to bin numbers

Manga about a female worker who got dragged into another world together with this high school girl and she was just told she's not needed anymore



Can I create a good Speech Recognition Engine while having millions of recorded conversations?



2019 Community Moderator ElectionHow can I create a classifier using the feature map of a CNN?How can I create a space in IBM Cloud?speech accent recognition data augmentation and trainingwhich algorithm will be good for detecting and recognition of faces from variety of anglesHow can I create a negation of the sentence?Can I create pretrain model with tensorflow?Opensource Speech Recognition Library that is secure and trained on large dataIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?










1












$begingroup$


I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?



Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!










share|improve this question









$endgroup$
















    1












    $begingroup$


    I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?



    Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!










    share|improve this question









    $endgroup$














      1












      1








      1





      $begingroup$


      I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?



      Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!










      share|improve this question









      $endgroup$




      I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?



      Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!







      deep-learning speech-to-text






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 29 at 8:46









      BlenzusBlenzus

      15010




      15010




















          1 Answer
          1






          active

          oldest

          votes


















          2












          $begingroup$

          Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.



          High level steps are :



          1. Train a GAN on raw audio

          2. Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

          3. Merge these models and train on labeled samples

          For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)



          https://deepmind.com/blog/wavenet-generative-model-raw-audio/



          Papers that cover design and overall approach :



          https://arxiv.org/abs/1711.01567
          https://arxiv.org/abs/1803.10132






          share|improve this answer









          $endgroup$












          • $begingroup$
            Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
            $endgroup$
            – Blenzus
            Mar 29 at 9:31











          • $begingroup$
            GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
            $endgroup$
            – Shamit Verma
            Mar 29 at 10:12










          • $begingroup$
            Thanks a lot Shamit!
            $endgroup$
            – Blenzus
            Mar 29 at 10:38











          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48199%2fcan-i-create-a-good-speech-recognition-engine-while-having-millions-of-recorded%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2












          $begingroup$

          Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.



          High level steps are :



          1. Train a GAN on raw audio

          2. Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

          3. Merge these models and train on labeled samples

          For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)



          https://deepmind.com/blog/wavenet-generative-model-raw-audio/



          Papers that cover design and overall approach :



          https://arxiv.org/abs/1711.01567
          https://arxiv.org/abs/1803.10132






          share|improve this answer









          $endgroup$












          • $begingroup$
            Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
            $endgroup$
            – Blenzus
            Mar 29 at 9:31











          • $begingroup$
            GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
            $endgroup$
            – Shamit Verma
            Mar 29 at 10:12










          • $begingroup$
            Thanks a lot Shamit!
            $endgroup$
            – Blenzus
            Mar 29 at 10:38















          2












          $begingroup$

          Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.



          High level steps are :



          1. Train a GAN on raw audio

          2. Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

          3. Merge these models and train on labeled samples

          For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)



          https://deepmind.com/blog/wavenet-generative-model-raw-audio/



          Papers that cover design and overall approach :



          https://arxiv.org/abs/1711.01567
          https://arxiv.org/abs/1803.10132






          share|improve this answer









          $endgroup$












          • $begingroup$
            Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
            $endgroup$
            – Blenzus
            Mar 29 at 9:31











          • $begingroup$
            GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
            $endgroup$
            – Shamit Verma
            Mar 29 at 10:12










          • $begingroup$
            Thanks a lot Shamit!
            $endgroup$
            – Blenzus
            Mar 29 at 10:38













          2












          2








          2





          $begingroup$

          Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.



          High level steps are :



          1. Train a GAN on raw audio

          2. Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

          3. Merge these models and train on labeled samples

          For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)



          https://deepmind.com/blog/wavenet-generative-model-raw-audio/



          Papers that cover design and overall approach :



          https://arxiv.org/abs/1711.01567
          https://arxiv.org/abs/1803.10132






          share|improve this answer









          $endgroup$



          Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.



          High level steps are :



          1. Train a GAN on raw audio

          2. Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

          3. Merge these models and train on labeled samples

          For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)



          https://deepmind.com/blog/wavenet-generative-model-raw-audio/



          Papers that cover design and overall approach :



          https://arxiv.org/abs/1711.01567
          https://arxiv.org/abs/1803.10132







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 29 at 9:23









          Shamit VermaShamit Verma

          1,4841214




          1,4841214











          • $begingroup$
            Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
            $endgroup$
            – Blenzus
            Mar 29 at 9:31











          • $begingroup$
            GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
            $endgroup$
            – Shamit Verma
            Mar 29 at 10:12










          • $begingroup$
            Thanks a lot Shamit!
            $endgroup$
            – Blenzus
            Mar 29 at 10:38
















          • $begingroup$
            Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
            $endgroup$
            – Blenzus
            Mar 29 at 9:31











          • $begingroup$
            GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
            $endgroup$
            – Shamit Verma
            Mar 29 at 10:12










          • $begingroup$
            Thanks a lot Shamit!
            $endgroup$
            – Blenzus
            Mar 29 at 10:38















          $begingroup$
          Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
          $endgroup$
          – Blenzus
          Mar 29 at 9:31





          $begingroup$
          Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
          $endgroup$
          – Blenzus
          Mar 29 at 9:31













          $begingroup$
          GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
          $endgroup$
          – Shamit Verma
          Mar 29 at 10:12




          $begingroup$
          GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
          $endgroup$
          – Shamit Verma
          Mar 29 at 10:12












          $begingroup$
          Thanks a lot Shamit!
          $endgroup$
          – Blenzus
          Mar 29 at 10:38




          $begingroup$
          Thanks a lot Shamit!
          $endgroup$
          – Blenzus
          Mar 29 at 10:38

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48199%2fcan-i-create-a-good-speech-recognition-engine-while-having-millions-of-recorded%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High