Training an acoustic model for a speech-to-text engine Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsNetwork design for extracting MFCC features from audio samplespython - What is the format of the WAV file for a Text to Speech Neural Network?Input and output feature shapes in CNN for speech recognitionHow to resume training of a model?Text understanding and mappingIs machine learning a viable tool to map accent from speech onto text/syllables?Representing output label for character level speec recognition using RNNApproximate Text compression by training model?How to efficiently store large amounts of speech to text dataWhat to use in setting up a Speech to Text engine in production?

A term for a woman complaining about things/begging in a cute/childish way

After Sam didn't return home in the end, were he and Al still friends?

Would color changing eyes affect vision?

How can I prevent/balance waiting and turtling as a response to cooldown mechanics

Why weren't discrete x86 CPUs ever used in game hardware?

How does light 'choose' between wave and particle behaviour?

Random body shuffle every night—can we still function?

Understanding p-Values using an example

Does silver oxide react with hydrogen sulfide?

How can a team of shapeshifters communicate?

Why is it faster to reheat something than it is to cook it?

Asymptotics question

Is there hard evidence that the grant peer review system performs significantly better than random?

Does the Black Tentacles spell do damage twice at the start of turn to an already restrained creature?

RSA find public exponent

What adaptations would allow standard fantasy dwarves to survive in the desert?

Is multiple magic items in one inherently imbalanced?

Co-worker has annoying ringtone

How much damage would a cupful of neutron star matter do to the Earth?

two integers one line calculator

Should a wizard buy fine inks every time he want to copy spells into his spellbook?

What initially awakened the Balrog?

Can two person see the same photon?

Relating to the President and obstruction, were Mueller's conclusions preordained?



Training an acoustic model for a speech-to-text engine



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsNetwork design for extracting MFCC features from audio samplespython - What is the format of the WAV file for a Text to Speech Neural Network?Input and output feature shapes in CNN for speech recognitionHow to resume training of a model?Text understanding and mappingIs machine learning a viable tool to map accent from speech onto text/syllables?Representing output label for character level speec recognition using RNNApproximate Text compression by training model?How to efficiently store large amounts of speech to text dataWhat to use in setting up a Speech to Text engine in production?










0












$begingroup$


What are the steps for training an acoustic model? The format of the data (the audio) includes its length and other characteristics. If anyone could provide a simple example of how to train an acoustic model, it would be greatly appreciated.










share|improve this question











$endgroup$
















    0












    $begingroup$


    What are the steps for training an acoustic model? The format of the data (the audio) includes its length and other characteristics. If anyone could provide a simple example of how to train an acoustic model, it would be greatly appreciated.










    share|improve this question











    $endgroup$














      0












      0








      0





      $begingroup$


      What are the steps for training an acoustic model? The format of the data (the audio) includes its length and other characteristics. If anyone could provide a simple example of how to train an acoustic model, it would be greatly appreciated.










      share|improve this question











      $endgroup$




      What are the steps for training an acoustic model? The format of the data (the audio) includes its length and other characteristics. If anyone could provide a simple example of how to train an acoustic model, it would be greatly appreciated.







      neural-network nlp speech-to-text






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 4 at 18:44









      Stephen Rauch

      1,52551330




      1,52551330










      asked Apr 4 at 9:25









      BlenzusBlenzus

      16911




      16911




















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :



          enter image description here



          http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf



          Slide 16 has very high level description of feature engineering.



          enter image description here



          Following papers provide more detailed analysis of audio processing for speech to text :



          http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
          http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
          https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf






          share|improve this answer









          $endgroup$




















            1












            $begingroup$

            There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.



            I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).



            Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.



            If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.






            share|improve this answer









            $endgroup$













              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "557"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48589%2ftraining-an-acoustic-model-for-a-speech-to-text-engine%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1












              $begingroup$

              This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :



              enter image description here



              http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf



              Slide 16 has very high level description of feature engineering.



              enter image description here



              Following papers provide more detailed analysis of audio processing for speech to text :



              http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
              http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
              https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf






              share|improve this answer









              $endgroup$

















                1












                $begingroup$

                This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :



                enter image description here



                http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf



                Slide 16 has very high level description of feature engineering.



                enter image description here



                Following papers provide more detailed analysis of audio processing for speech to text :



                http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
                http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
                https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf






                share|improve this answer









                $endgroup$















                  1












                  1








                  1





                  $begingroup$

                  This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :



                  enter image description here



                  http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf



                  Slide 16 has very high level description of feature engineering.



                  enter image description here



                  Following papers provide more detailed analysis of audio processing for speech to text :



                  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
                  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
                  https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf






                  share|improve this answer









                  $endgroup$



                  This paper has details on how to prepare Audio data (and merge it with Language Models) for speech-to-text :



                  enter image description here



                  http://slazebni.cs.illinois.edu/spring17/lec26_audio.pdf



                  Slide 16 has very high level description of feature engineering.



                  enter image description here



                  Following papers provide more detailed analysis of audio processing for speech to text :



                  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.6802&rep=rep1&type=pdf
                  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.689.4627&rep=rep1&type=pdf
                  https://www.ijariit.com/manuscripts/v4i3/V4I3-1646.pdf







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Apr 4 at 10:32









                  Shamit VermaShamit Verma

                  1,6391414




                  1,6391414





















                      1












                      $begingroup$

                      There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.



                      I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).



                      Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.



                      If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.






                      share|improve this answer









                      $endgroup$

















                        1












                        $begingroup$

                        There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.



                        I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).



                        Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.



                        If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.






                        share|improve this answer









                        $endgroup$















                          1












                          1








                          1





                          $begingroup$

                          There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.



                          I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).



                          Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.



                          If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.






                          share|improve this answer









                          $endgroup$



                          There are many different ways to do this. It depends a lot on what you want to do. A general dictation algorithm which works for most people who speak the same language? One that works or optimises itself for each person? Just a few keywords to input into a device? And so on.



                          I'd personally just download some of the open source versions on the internet (https://github.com/topics/speech-to-text), try to understand how they work (literature like the one above can help), and then write something that fits my requirements. That will also help avoid unnecessary complications when trying to interpret sounds. Translating sounds into phonemes, for instance, is a step which for most applications will only add complexity and, with the differences in how people pronounce things, increase errors. A database/table with the sounds of complete syllables or words is more likely to lead to results in many use cases. The database should be optimised each time a user corrects something the computer wrote - by adding the according sounds-word combination, by increasing the priority of the solution which turned out to be correct if there where other possible choices, or by using surrounding words as indicators of the correct choice (dynamic word-pairing).



                          Be aware that even if your code identifies sounds better than any human, it will still make a lot of mistakes with similar sounding words, or by missing where one word ends and the other begins. The usual way around is word pairs - if you have different possibilities, you choose the ones where words often come together. That may even produce text better than most humans - but still with very funny misinterpretations. If you want to avoid that, I'm afraid you'd have to model a far more complex ai which does not yet exist except maybe in a few labs - making sense of what someone says so it can have informed guesses which of many similar sounding options were actually meant. Not to mention catching any errors by the speaker.



                          If you only need a very limited vocabulary, you could also let people train the device (which would then work for any language) and maybe learn from experience - like when a word is pronounced differently sometimes. This can be done as easily as just recording a sound for each word. Adding similar sounds to each word when the user doesn't quickly change his mind (an indication that the computer misunderstood) or different sounds when a correction occurs is also an option, and helps improve recognition. The sounds and words could also be averaged out or parsed for differentiating parts, for better hit rates. Such measures would also work for above database.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Apr 4 at 19:46









                          Carl DombrowskiCarl Dombrowski

                          1111




                          1111



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48589%2ftraining-an-acoustic-model-for-a-speech-to-text-engine%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                              Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

                              Do these cracks on my tires look bad? The Next CEO of Stack OverflowDry rot tire should I replace?Having to replace tiresFishtailed so easily? Bad tires? ABS?Filling the tires with something other than air, to avoid puncture hassles?Used Michelin tires safe to install?Do these tyre cracks necessitate replacement?Rumbling noise: tires or mechanicalIs it possible to fix noisy feathered tires?Are bad winter tires still better than summer tires in winter?Torque converter failure - Related to replacing only 2 tires?Why use snow tires on all 4 wheels on 2-wheel-drive cars?