Training on accurate data versus noisy data2019 Community Moderator ElectionShould I go for a 'balanced' dataset or a 'representative' dataset?Classification technique for unsupervised data?Predicting future airfare using past dataRprop is too noisyThe connection between optimization and generalizationReducing noisy data from non normal distribution of data with std deviation?Paramaeter estimation in noisy conditions with Machine Learning, possible?In Machine Learning, what is the point of using stratified sampling in selecting test set data?Is a good shuffle random state for training data really good for the model?Training deep CNN with noisy dataset

when is out of tune ok?

Why are there no referendums in the US?

Sort a list by elements of another list

What is the opposite of 'gravitas'?

Class Action - which options I have?

Why Were Madagascar and New Zealand Discovered So Late?

What is the difference between "behavior" and "behaviour"?

What's the purpose of "true" in bash "if sudo true; then"

Where does the Z80 processor start executing from?

Is a stroke of luck acceptable after a series of unfavorable events?

How do we know the LHC results are robust?

How did Doctor Strange see the winning outcome in Avengers: Infinity War?

Sequence of Tenses: Translating the subjunctive

How long to clear the 'suck zone' of a turbofan after start is initiated?

What is paid subscription needed for in Mortal Kombat 11?

Large drywall patch supports

Was Spock the First Vulcan in Starfleet?

How did Arya survive the stabbing?

Valid Badminton Score?

Hostile work environment after whistle-blowing on coworker and our boss. What do I do?

Term for the "extreme-extension" version of a straw man fallacy?

Can the discrete variable be a negative number?

How does it work when somebody invests in my business?

Integer addition + constant, is it a group?



Training on accurate data versus noisy data



2019 Community Moderator ElectionShould I go for a 'balanced' dataset or a 'representative' dataset?Classification technique for unsupervised data?Predicting future airfare using past dataRprop is too noisyThe connection between optimization and generalizationReducing noisy data from non normal distribution of data with std deviation?Paramaeter estimation in noisy conditions with Machine Learning, possible?In Machine Learning, what is the point of using stratified sampling in selecting test set data?Is a good shuffle random state for training data really good for the model?Training deep CNN with noisy dataset










2












$begingroup$


I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?



Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.



But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?










share|improve this question









$endgroup$











  • $begingroup$
    Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
    $endgroup$
    – Adrian Keister
    Sep 12 '18 at 20:55















2












$begingroup$


I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?



Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.



But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?










share|improve this question









$endgroup$











  • $begingroup$
    Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
    $endgroup$
    – Adrian Keister
    Sep 12 '18 at 20:55













2












2








2





$begingroup$


I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?



Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.



But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?










share|improve this question









$endgroup$




I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?



Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.



But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?







machine-learning error-handling noise generalization






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Sep 12 '18 at 20:05









Mathews24Mathews24

1056




1056











  • $begingroup$
    Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
    $endgroup$
    – Adrian Keister
    Sep 12 '18 at 20:55
















  • $begingroup$
    Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
    $endgroup$
    – Adrian Keister
    Sep 12 '18 at 20:55















$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55




$begingroup$
Training on the clean data could definitely result in overfitting. If you don't have to choose, why not try both? Training on the clean data could help you understand the data you already have better. I would definitely recommend training on noisy data for predictions. Unless there's a way to clean incoming noisy data? What's the nature of the noise?
$endgroup$
– Adrian Keister
Sep 12 '18 at 20:55










1 Answer
1






active

oldest

votes


















0












$begingroup$

The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.



It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.






share|improve this answer








New contributor




Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$












    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38174%2ftraining-on-accurate-data-versus-noisy-data%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.



    It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.






    share|improve this answer








    New contributor




    Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$

















      0












      $begingroup$

      The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.



      It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.






      share|improve this answer








      New contributor




      Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$















        0












        0








        0





        $begingroup$

        The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.



        It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.






        share|improve this answer








        New contributor




        Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$



        The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.



        It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.







        share|improve this answer








        New contributor




        Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        share|improve this answer



        share|improve this answer






        New contributor




        Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered Mar 21 at 19:27









        PrachiPrachi

        1




        1




        New contributor




        Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        Prachi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38174%2ftraining-on-accurate-data-versus-noisy-data%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

            Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

            Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High