Training on accurate data versus noisy data2019 Community Moderator ElectionShould I go for a 'balanced' dataset or a 'representative' dataset?Classification technique for unsupervised data?Predicting future airfare using past dataRprop is too noisyThe connection between optimization and generalizationReducing noisy data from non normal distribution of data with std deviation?Paramaeter estimation in noisy conditions with Machine Learning, possible?In Machine Learning, what is the point of using stratified sampling in selecting test set data?Is a good shuffle random state for training data really good for the model?Training deep CNN with noisy dataset

when is out of tune ok?

Why are there no referendums in the US?

Sort a list by elements of another list

What is the opposite of 'gravitas'?

Class Action - which options I have?

Why Were Madagascar and New Zealand Discovered So Late?

What is the difference between "behavior" and "behaviour"?

What's the purpose of "true" in bash "if sudo true; then"

Where does the Z80 processor start executing from?

Is a stroke of luck acceptable after a series of unfavorable events?

How do we know the LHC results are robust?

How did Doctor Strange see the winning outcome in Avengers: Infinity War?

Sequence of Tenses: Translating the subjunctive

How long to clear the 'suck zone' of a turbofan after start is initiated?

What is paid subscription needed for in Mortal Kombat 11?

Large drywall patch supports

Was Spock the First Vulcan in Starfleet?

How did Arya survive the stabbing?

Valid Badminton Score?

Hostile work environment after whistle-blowing on coworker and our boss. What do I do?

Term for the "extreme-extension" version of a straw man fallacy?

Can the discrete variable be a negative number?

How does it work when somebody invests in my business?

Integer addition + constant, is it a group?

Training on accurate data versus noisy data

2019 Community Moderator ElectionShould I go for a 'balanced' dataset or a 'representative' dataset?Classification technique for unsupervised data?Predicting future airfare using past dataRprop is too noisyThe connection between optimization and generalizationReducing noisy data from non normal distribution of data with std deviation?Paramaeter estimation in noisy conditions with Machine Learning, possible?In Machine Learning, what is the point of using stratified sampling in selecting test set data?Is a good shuffle random state for training data really good for the model?Training deep CNN with noisy dataset

I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?

Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.

But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?

asked Sep 12 '18 at 20:05

Mathews24

1056

$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55

add a comment |

But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?

asked Sep 12 '18 at 20:05

Mathews24

1056

$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55

add a comment |

But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?

asked Sep 12 '18 at 20:05

Mathews24

1056

But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?

machine-learning error-handling noise generalization

asked Sep 12 '18 at 20:05

Mathews24

1056

asked Sep 12 '18 at 20:05

Mathews24

1056

asked Sep 12 '18 at 20:05

Mathews24

1056

asked Sep 12 '18 at 20:05

Mathews24

1056

asked Sep 12 '18 at 20:05

Mathews24

1056

$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55

add a comment |

$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55

Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?

– Adrian Keister
Sep 12 '18 at 20:55

add a comment |

1 Answer
1

active

oldest

votes

The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.

It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.

answered Mar 21 at 19:27

Prachi

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38174%2ftraining-on-accurate-data-versus-noisy-data%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Mar 21 at 19:27

Prachi

New contributor

add a comment |

answered Mar 21 at 19:27

Prachi

New contributor

add a comment |

answered Mar 21 at 19:27

Prachi

New contributor

answered Mar 21 at 19:27

Prachi

New contributor

answered Mar 21 at 19:27

Prachi

New contributor

answered Mar 21 at 19:27

Prachi

answered Mar 21 at 19:27

Prachi

New contributor

Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1