Can I create a good Speech Recognition Engine while having millions of recorded conversations?2019 Community Moderator ElectionHow can I create a classifier using the feature map of a CNN?How can I create a space in IBM Cloud?speech accent recognition data augmentation and trainingwhich algorithm will be good for detecting and recognition of faces from variety of anglesHow can I create a negation of the sentence?Can I create pretrain model with tensorflow?Opensource Speech Recognition Library that is secure and trained on large dataIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?

Prime joint compound before latex paint?

Unbreakable Formation vs. Cry of the Carnarium

Filling an area between two curves

Is it possible to make sharp wind that can cut stuff from afar?

Is there a name of the flying bionic bird?

What is the command to reset a PC without deleting any files

Why was the "bread communication" in the arena of Catching Fire left out in the movie?

Symmetry in quantum mechanics

"My colleague's body is amazing"

Information to fellow intern about hiring?

New order #4: World

Does the average primeness of natural numbers tend to zero?

Lied on resume at previous job

How to make payment on the internet without leaving a money trail?

Find the positive root of a 4-th degree polynomial equation

Why is my log file so massive? 22gb. I am running log backups

Could a US political party gain complete control over the government by removing checks & balances?

Where else does the Shulchan Aruch quote an authority by name?

COUNT(*) or MAX(id) - which is faster?

How to manage monthly salary

What to wear for invited talk in Canada

How is it possible for user's password to be changed after storage was encrypted? (on OS X, Android)

Map list to bin numbers

Manga about a female worker who got dragged into another world together with this high school girl and she was just told she's not needed anymore

Can I create a good Speech Recognition Engine while having millions of recorded conversations?

2019 Community Moderator ElectionHow can I create a classifier using the feature map of a CNN?How can I create a space in IBM Cloud?speech accent recognition data augmentation and trainingwhich algorithm will be good for detecting and recognition of faces from variety of anglesHow can I create a negation of the sentence?Can I create pretrain model with tensorflow?Opensource Speech Recognition Library that is secure and trained on large dataIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?

I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?

Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!

asked Mar 29 at 8:46

Blenzus

15010

add a comment |

asked Mar 29 at 8:46

Blenzus

15010

add a comment |

asked Mar 29 at 8:46

Blenzus

15010

deep-learning speech-to-text

asked Mar 29 at 8:46

Blenzus

15010

asked Mar 29 at 8:46

Blenzus

15010

asked Mar 29 at 8:46

Blenzus

15010

asked Mar 29 at 8:46

Blenzus

15010

asked Mar 29 at 8:46

Blenzus

15010

add a comment |

1 Answer
1

active

oldest

votes

Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.

High level steps are :

Train a GAN on raw audio

Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

Merge these models and train on labeled samples

For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Papers that cover design and overall approach :

https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132

answered Mar 29 at 9:23

Shamit Verma

1,4841214

$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31

$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12

$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48199%2fcan-i-create-a-good-speech-recognition-engine-while-having-millions-of-recorded%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

High level steps are :

Train a GAN on raw audio

Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

Merge these models and train on labeled samples

For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Papers that cover design and overall approach :

https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132

answered Mar 29 at 9:23

Shamit Verma

1,4841214

$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31

$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12

$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38

add a comment |

High level steps are :

Train a GAN on raw audio

Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

Merge these models and train on labeled samples

For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Papers that cover design and overall approach :

https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132

answered Mar 29 at 9:23

Shamit Verma

1,4841214

$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31

$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12

$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38

add a comment |

High level steps are :

Train a GAN on raw audio

Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

Merge these models and train on labeled samples

For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Papers that cover design and overall approach :

https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132

answered Mar 29 at 9:23

Shamit Verma

1,4841214

High level steps are :

Train a GAN on raw audio

Train Language models on raw text data (need not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.

Merge these models and train on labeled samples

For step 1, Google WaveNet is a good example (it is eventually used for Test-to-Speech, it is a component in Speech-t-Text as well)

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Papers that cover design and overall approach :

https://arxiv.org/abs/1711.01567
https://arxiv.org/abs/1803.10132

answered Mar 29 at 9:23

Shamit Verma

1,4841214

answered Mar 29 at 9:23

Shamit Verma

1,4841214

answered Mar 29 at 9:23

Shamit Verma

1,4841214

answered Mar 29 at 9:23

Shamit Verma

1,4841214

$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31

$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12

$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38

add a comment |

$begingroup$
Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!
$endgroup$
– Blenzus
Mar 29 at 9:31

$begingroup$
GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.
$endgroup$
– Shamit Verma
Mar 29 at 10:12

$begingroup$
Thanks a lot Shamit!
$endgroup$
– Blenzus
Mar 29 at 10:38

Whats a GAN? in first point. so basically, if i understand correctly , i have to train a model on raw audio, and a model on raw text data, and then merge them and train on a vocal samples with it's corresponding text translations? Thank you for the answer!

– Blenzus
Mar 29 at 9:31

GAN : skymind.ai/wiki/generative-adversarial-network-gan . GAN learns "business domain" that can be used to solve problems related to that domain.

– Shamit Verma
Mar 29 at 10:12

Thanks a lot Shamit!

– Blenzus
Mar 29 at 10:38

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1