How to plan an analysis to prevent overfitting?how to explain the behaviour: linear svm does better than non-linear RBFHow to represent target variable for chess AIMachine Learning: Writing PoemsHow to perform Logistic Regression with a large number of features?How to approach speech analysis?Model Selection with Oversampling/ Cross-Validation leads to similar test results in 2 approachesI have limited samples for one class, unlimited samples for the other class. Need to balance?ML algorithms for regression in the case of label noise with a known distribution?SciKit-Learn Decision Tree OverfittingHow can I measure the reliability of the specificity of a model with very small train, test, and validation datasets?

What is IP squat space

How many prime numbers are there that can't be written as a sum of two composite numbers?

Could the Saturn V actually have launched astronauts around Venus?

Is having access to past exams cheating and, if yes, could it be proven just by a good grade?

Why doesn't the EU now just force the UK to choose between referendum and no-deal?

Good allowance savings plan?

Why are the outputs of printf and std::cout different

When do we add an hyphen (-) to a complex adjective word?

Does this property of comaximal ideals always holds?

Giving EXEC (@Variable) a Column name and Concatenation

Should we release the security issues we found in our product as CVE or we can just update those on weekly release notes?

Who is our nearest planetary neighbor, on average?

Did CP/M support custom hardware using device drivers?

Is it normal that my co-workers at a fitness company criticize my food choices?

Can anyone tell me why this program fails?

Force user to remove USB token

Why did it take so long to abandon sail after steamships were demonstrated?

How could a female member of a species produce eggs unto death?

Happy pi day, everyone!

Does splitting a potentially monolithic application into several smaller ones help prevent bugs?

Would it take an action or something similar to activate the blindsight property of a Dragon Mask?

Increase thickness of graph lines larger than ultra thick

What has been your most complicated TikZ drawing?

Are the common programs (for example: "ls", "cat") in Linux and BSD come from the same source code?



How to plan an analysis to prevent overfitting?


how to explain the behaviour: linear svm does better than non-linear RBFHow to represent target variable for chess AIMachine Learning: Writing PoemsHow to perform Logistic Regression with a large number of features?How to approach speech analysis?Model Selection with Oversampling/ Cross-Validation leads to similar test results in 2 approachesI have limited samples for one class, unlimited samples for the other class. Need to balance?ML algorithms for regression in the case of label noise with a known distribution?SciKit-Learn Decision Tree OverfittingHow can I measure the reliability of the specificity of a model with very small train, test, and validation datasets?













2












$begingroup$


Coming from statistics, I'm freshly trying to learn machine learning. I've read a lot of tutorials about ML, but have no real training.



I'm working on a little project where my dataset have 6k lines and around 300 features.



As I've read in my tutorials, I split my dataset into a training sample (80%) and a testing sample (20%), and then train my algorithm on the training sample with cross-validation (5 folds).



As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results, with different sensitivity, specificity and precision.



I guess that if I re-run the program until metrics are good, my algorithm will be overfitted, and I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.



If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere ? Is it even OK to do so ? (it would not always be in statistics)



In case it matters, I'm working with python's scikit-learn module.



*PS: my outcome is binary and my features are mostly binary, with few categorial and few numeric. I'm thinking about logistic, but which algorithm would be the best one ?










share|improve this question







New contributor




Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$
















    2












    $begingroup$


    Coming from statistics, I'm freshly trying to learn machine learning. I've read a lot of tutorials about ML, but have no real training.



    I'm working on a little project where my dataset have 6k lines and around 300 features.



    As I've read in my tutorials, I split my dataset into a training sample (80%) and a testing sample (20%), and then train my algorithm on the training sample with cross-validation (5 folds).



    As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results, with different sensitivity, specificity and precision.



    I guess that if I re-run the program until metrics are good, my algorithm will be overfitted, and I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.



    If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere ? Is it even OK to do so ? (it would not always be in statistics)



    In case it matters, I'm working with python's scikit-learn module.



    *PS: my outcome is binary and my features are mostly binary, with few categorial and few numeric. I'm thinking about logistic, but which algorithm would be the best one ?










    share|improve this question







    New contributor




    Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$














      2












      2








      2





      $begingroup$


      Coming from statistics, I'm freshly trying to learn machine learning. I've read a lot of tutorials about ML, but have no real training.



      I'm working on a little project where my dataset have 6k lines and around 300 features.



      As I've read in my tutorials, I split my dataset into a training sample (80%) and a testing sample (20%), and then train my algorithm on the training sample with cross-validation (5 folds).



      As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results, with different sensitivity, specificity and precision.



      I guess that if I re-run the program until metrics are good, my algorithm will be overfitted, and I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.



      If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere ? Is it even OK to do so ? (it would not always be in statistics)



      In case it matters, I'm working with python's scikit-learn module.



      *PS: my outcome is binary and my features are mostly binary, with few categorial and few numeric. I'm thinking about logistic, but which algorithm would be the best one ?










      share|improve this question







      New contributor




      Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      Coming from statistics, I'm freshly trying to learn machine learning. I've read a lot of tutorials about ML, but have no real training.



      I'm working on a little project where my dataset have 6k lines and around 300 features.



      As I've read in my tutorials, I split my dataset into a training sample (80%) and a testing sample (20%), and then train my algorithm on the training sample with cross-validation (5 folds).



      As I re-ran my program twice (I've only tested KNN which I now know is quite not appropriate), I got really different results, with different sensitivity, specificity and precision.



      I guess that if I re-run the program until metrics are good, my algorithm will be overfitted, and I also guess it would be because of the resample of test/training samples, but please correct me if I'm wrong.



      If I'm going to try a lot of algorithms to see what I can get, should I fix my samples somewhere ? Is it even OK to do so ? (it would not always be in statistics)



      In case it matters, I'm working with python's scikit-learn module.



      *PS: my outcome is binary and my features are mostly binary, with few categorial and few numeric. I'm thinking about logistic, but which algorithm would be the best one ?







      machine-learning project-planning






      share|improve this question







      New contributor




      Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 15 hours ago









      Dan ChaltielDan Chaltiel

      1113




      1113




      New contributor




      Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Dan Chaltiel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          0






          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );






          Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47316%2fhow-to-plan-an-analysis-to-prevent-overfitting%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.












          Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.











          Dan Chaltiel is a new contributor. Be nice, and check out our Code of Conduct.














          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47316%2fhow-to-plan-an-analysis-to-prevent-overfitting%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High