Can I create a good Speech Recognition Engine while having millions of recorded conversations?2019 Community Moderator ElectionHow can I create a classifier using the feature map of a CNN?How can I create a space in IBM Cloud?speech accent recognition data augmentation and trainingwhich algorithm will be good for detecting and recognition of faces from variety of anglesHow can I create a negation of the sentence?Can I create pretrain model with tensorflow?Opensource Speech Recognition Library that is secure and trained on large dataIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?
Prime joint compound before latex paint?
Unbreakable Formation vs. Cry of the Carnarium
Filling an area between two curves
Is it possible to make sharp wind that can cut stuff from afar?
Is there a name of the flying bionic bird?
What is the command to reset a PC without deleting any files
Why was the "bread communication" in the arena of Catching Fire left out in the movie?
Symmetry in quantum mechanics
"My colleague's body is amazing"
Information to fellow intern about hiring?
New order #4: World
Does the average primeness of natural numbers tend to zero?
Lied on resume at previous job
How to make payment on the internet without leaving a money trail?
Find the positive root of a 4-th degree polynomial equation
Why is my log file so massive? 22gb. I am running log backups
Could a US political party gain complete control over the government by removing checks & balances?
Where else does the Shulchan Aruch quote an authority by name?
COUNT(*) or MAX(id) - which is faster?
How to manage monthly salary
What to wear for invited talk in Canada
How is it possible for user's password to be changed after storage was encrypted? (on OS X, Android)
Map list to bin numbers
Manga about a female worker who got dragged into another world together with this high school girl and she was just told she's not needed anymore
Can I create a good Speech Recognition Engine while having millions of recorded conversations?
2019 Community Moderator ElectionHow can I create a classifier using the feature map of a CNN?How can I create a space in IBM Cloud?speech accent recognition data augmentation and trainingwhich algorithm will be good for detecting and recognition of faces from variety of anglesHow can I create a negation of the sentence?Can I create pretrain model with tensorflow?Opensource Speech Recognition Library that is secure and trained on large dataIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?
$begingroup$
I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?
Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!
deep-learning speech-to-text
$endgroup$
add a comment |
$begingroup$
I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?
Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!
deep-learning speech-to-text
$endgroup$
add a comment |
$begingroup$
I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?
Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!
deep-learning speech-to-text
$endgroup$
I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?
Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!
deep-learning speech-to-text
deep-learning speech-to-text
asked Mar 29 at 8:46
BlenzusBlenzus
15010
15010
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.
High level steps are :
- Train a GAN on raw audio
- Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.
- Merge these models and train on labeled samples
For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Papers that cover design and overall approach :
https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132
$endgroup$
$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31
$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12
$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48199%2fcan-i-create-a-good-speech-recognition-engine-while-having-millions-of-recorded%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.
High level steps are :
- Train a GAN on raw audio
- Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.
- Merge these models and train on labeled samples
For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Papers that cover design and overall approach :
https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132
$endgroup$
$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31
$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12
$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38
add a comment |
$begingroup$
Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.
High level steps are :
- Train a GAN on raw audio
- Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.
- Merge these models and train on labeled samples
For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Papers that cover design and overall approach :
https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132
$endgroup$
$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31
$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12
$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38
add a comment |
$begingroup$
Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.
High level steps are :
- Train a GAN on raw audio
- Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.
- Merge these models and train on labeled samples
For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Papers that cover design and overall approach :
https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132
$endgroup$
Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.
High level steps are :
- Train a GAN on raw audio
- Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.
- Merge these models and train on labeled samples
For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Papers that cover design and overall approach :
https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132
answered Mar 29 at 9:23
Shamit VermaShamit Verma
1,4841214
1,4841214
$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31
$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12
$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38
add a comment |
$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31
$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12
$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38
$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31
$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31
$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12
$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12
$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38
$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48199%2fcan-i-create-a-good-speech-recognition-engine-while-having-millions-of-recorded%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown