Machine learning testing data Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsMachine Learning - Where is the difference between one-class, binary-class and multinominal-class classification?Machine Learning - Where is the difference between one-class, binary-class and multinominal-class classification?How to do Machine Learning the right way?Training And Testing Error Curves caret package in rFind effective feature on machine learning classification task with scikit-learnClass imbalance problem?Learning Algorithm that decide which model gives better results for each testing instancestatistical significance test between binary label featuresHow exactly does class_weight in Keras work?Recommendations for Neural Network Stacking Project

How were pictures turned from film to a big picture in a picture frame before digital scanning?

Is there hard evidence that the grant peer review system performs significantly better than random?

What is the meaning of 'breadth' in breadth first search?

Electrolysis of water: Which equations to use? (IB Chem)

Can you explain what "processes and tools" means in the first Agile principle?

What order were files/directories output in dir?

Antipodal Land Area Calculation

Why weren't discrete x86 CPUs ever used in game hardware?

AppleTVs create a chatty alternate WiFi network

How much damage would a cupful of neutron star matter do to the Earth?

How to unroll a parameter pack from right to left

File name problem(?)

What was the first language to use conditional keywords?

Is it possible to give , in economics, an example of a relation ( set of ordered pairs) that is not a function?

How can I prevent/balance waiting and turtling as a response to cooldown mechanics

Amount of permutations on an NxNxN Rubik's Cube

Why does it sometimes sound good to play a grace note as a lead in to a note in a melody?

What is "gratricide"?

What is the home of drows in Flanaess?

Would it be easier to apply for a UK visa if there is a host family to sponsor for you in going there?

Do wooden building fires get hotter than 600°C?

One-one communication

Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode

Google .dev domain strangely redirects to https



Machine learning testing data



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsMachine Learning - Where is the difference between one-class, binary-class and multinominal-class classification?Machine Learning - Where is the difference between one-class, binary-class and multinominal-class classification?How to do Machine Learning the right way?Training And Testing Error Curves caret package in rFind effective feature on machine learning classification task with scikit-learnClass imbalance problem?Learning Algorithm that decide which model gives better results for each testing instancestatistical significance test between binary label featuresHow exactly does class_weight in Keras work?Recommendations for Neural Network Stacking Project










1












$begingroup$


I am new to machine learning and it might be a bit of a stupid question.



I have implemented my model and its working. I have a question about running it on the testing data. Its a binary classification problem. If I know the proportions of classes in test data how could I use it to improve the performance of my model or prediction made by the model. So lets say 75% belong to class 1 and 25% to class 0



Any help is greatly appreciated










share|improve this question









$endgroup$
















    1












    $begingroup$


    I am new to machine learning and it might be a bit of a stupid question.



    I have implemented my model and its working. I have a question about running it on the testing data. Its a binary classification problem. If I know the proportions of classes in test data how could I use it to improve the performance of my model or prediction made by the model. So lets say 75% belong to class 1 and 25% to class 0



    Any help is greatly appreciated










    share|improve this question









    $endgroup$














      1












      1








      1





      $begingroup$


      I am new to machine learning and it might be a bit of a stupid question.



      I have implemented my model and its working. I have a question about running it on the testing data. Its a binary classification problem. If I know the proportions of classes in test data how could I use it to improve the performance of my model or prediction made by the model. So lets say 75% belong to class 1 and 25% to class 0



      Any help is greatly appreciated










      share|improve this question









      $endgroup$




      I am new to machine learning and it might be a bit of a stupid question.



      I have implemented my model and its working. I have a question about running it on the testing data. Its a binary classification problem. If I know the proportions of classes in test data how could I use it to improve the performance of my model or prediction made by the model. So lets say 75% belong to class 1 and 25% to class 0



      Any help is greatly appreciated







      machine-learning python






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 2 at 16:32









      JackJack

      61




      61




















          2 Answers
          2






          active

          oldest

          votes


















          7












          $begingroup$

          No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
          Data Leakage.



          Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.



          Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.



          I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..



          In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.






          share|improve this answer











          $endgroup$




















            0












            $begingroup$

            If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.



            However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.



            As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.



            One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.



            It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.



            You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.






            share|improve this answer









            $endgroup$













              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "557"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48459%2fmachine-learning-testing-data%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              7












              $begingroup$

              No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
              Data Leakage.



              Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.



              Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.



              I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..



              In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.






              share|improve this answer











              $endgroup$

















                7












                $begingroup$

                No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
                Data Leakage.



                Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.



                Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.



                I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..



                In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.






                share|improve this answer











                $endgroup$















                  7












                  7








                  7





                  $begingroup$

                  No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
                  Data Leakage.



                  Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.



                  Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.



                  I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..



                  In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.






                  share|improve this answer











                  $endgroup$



                  No, your model isn't supposed to know about your test data, if you include clues in your training about what's in your test data , you will do something called
                  Data Leakage.



                  Data leakage would lead to Over-fitting which will give you good results on that particular test set, but won't generalize to other data.



                  Lets say, you deploy this model in production and feed into it real-life data that it never encountered before, the predictions will be below the expectations you had in the training/testing phases because of those two phenomenons i mentionned.



                  I suggest you tweak a bit more your model during the training phase, maybe clean your data more, do something called OverSampling and UnderSampling if the target classes are imbalanced ( You have for example 90% / 10% proportions in your training dataset ), pick better features etc..



                  In conclusion : adjusting your model to have good predictions on your test data ,in particular, is not good practice and will lead to creating a bad model on unseen data.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Apr 2 at 16:49

























                  answered Apr 2 at 16:41









                  BlenzusBlenzus

                  16911




                  16911





















                      0












                      $begingroup$

                      If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.



                      However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.



                      As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.



                      One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.



                      It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.



                      You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.






                      share|improve this answer









                      $endgroup$

















                        0












                        $begingroup$

                        If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.



                        However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.



                        As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.



                        One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.



                        It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.



                        You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.






                        share|improve this answer









                        $endgroup$















                          0












                          0








                          0





                          $begingroup$

                          If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.



                          However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.



                          As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.



                          One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.



                          It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.



                          You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.






                          share|improve this answer









                          $endgroup$



                          If your results differ from expectations, you should look at individual errors and use those to correct the machine learning. If you use the percentage, the algorithm will potentially learn something completely different from what it's supposed to learn.



                          However, if you really do need to go by such stats (for instance because your machine learning is supposed to learn about it's own mistakes and how to correct them autonomously), I suggest adding another dimension to your learning ability - like 'confidence', which increases in the nodes which were involved in a better result and decreases in the nodes which were involved when things went worse. Nodes with low confidence might change faster or stop their activity completely. Nodes with high confidence would change less easily.



                          As you didn't detail which learning algorithm you use, node can be anything from a datapoint in a table to a simulated neuron or it's individual connections.



                          One level higher, your model may include a module which tracks the connections between changes in confidence. This would allow avoiding cycles of one increase always coming with a decrease elsewhere, and vice versa. Or an actually wrong circuit developing false confidence and damaging the learning in the rest of your model... So if 5 increases in confidence in one area lead to 6 decreases in confidence in another area in the next round (i.e., worse longterm results), the confidence changes might be done differently. This module would obviously also learn to chose better what to influence when - in the normal way.



                          It will require some fine tuning before that module will make learning faster than just the usual approach. Prepare for lots of test sets - or a game where your different models play against one another with similar data, and might be fine tuned using the evolutionary approach.



                          You should also make sure that you find a way to test the ai with atypical data - and a way for it to see whether the strange results were correct in the same way humans do tests and then get the solutions.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Apr 2 at 19:03









                          Carl DombrowskiCarl Dombrowski

                          111




                          111



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48459%2fmachine-learning-testing-data%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                              Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                              Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High