Training on accurate data versus noisy data2019 Community Moderator ElectionShould I go for a 'balanced' dataset or a 'representative' dataset?Classification technique for unsupervised data?Predicting future airfare using past dataRprop is too noisyThe connection between optimization and generalizationReducing noisy data from non normal distribution of data with std deviation?Paramaeter estimation in noisy conditions with Machine Learning, possible?In Machine Learning, what is the point of using stratified sampling in selecting test set data?Is a good shuffle random state for training data really good for the model?Training deep CNN with noisy dataset
when is out of tune ok?
Why are there no referendums in the US?
Sort a list by elements of another list
What is the opposite of 'gravitas'?
Class Action - which options I have?
Why Were Madagascar and New Zealand Discovered So Late?
What is the difference between "behavior" and "behaviour"?
What's the purpose of "true" in bash "if sudo true; then"
Where does the Z80 processor start executing from?
Is a stroke of luck acceptable after a series of unfavorable events?
How do we know the LHC results are robust?
How did Doctor Strange see the winning outcome in Avengers: Infinity War?
Sequence of Tenses: Translating the subjunctive
How long to clear the 'suck zone' of a turbofan after start is initiated?
What is paid subscription needed for in Mortal Kombat 11?
Large drywall patch supports
Was Spock the First Vulcan in Starfleet?
How did Arya survive the stabbing?
Valid Badminton Score?
Hostile work environment after whistle-blowing on coworker and our boss. What do I do?
Term for the "extreme-extension" version of a straw man fallacy?
Can the discrete variable be a negative number?
How does it work when somebody invests in my business?
Integer addition + constant, is it a group?
Training on accurate data versus noisy data
2019 Community Moderator ElectionShould I go for a 'balanced' dataset or a 'representative' dataset?Classification technique for unsupervised data?Predicting future airfare using past dataRprop is too noisyThe connection between optimization and generalizationReducing noisy data from non normal distribution of data with std deviation?Paramaeter estimation in noisy conditions with Machine Learning, possible?In Machine Learning, what is the point of using stratified sampling in selecting test set data?Is a good shuffle random state for training data really good for the model?Training deep CNN with noisy dataset
$begingroup$
I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?
Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.
But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?
machine-learning error-handling noise generalization
$endgroup$
add a comment |
$begingroup$
I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?
Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.
But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?
machine-learning error-handling noise generalization
$endgroup$
$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55
add a comment |
$begingroup$
I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?
Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.
But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?
machine-learning error-handling noise generalization
$endgroup$
I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?
Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.
But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?
machine-learning error-handling noise generalization
machine-learning error-handling noise generalization
asked Sep 12 '18 at 20:05
Mathews24Mathews24
1056
1056
$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55
add a comment |
$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55
$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55
$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.
It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38174%2ftraining-on-accurate-data-versus-noisy-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.
It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.
New contributor
$endgroup$
add a comment |
$begingroup$
The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.
It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.
New contributor
$endgroup$
add a comment |
$begingroup$
The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.
It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.
New contributor
$endgroup$
The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.
It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.
New contributor
New contributor
answered Mar 21 at 19:27
PrachiPrachi
1
1
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38174%2ftraining-on-accurate-data-versus-noisy-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55