Given a single discrete data set, how should I divide it into training data and test data?Why Is Overfitting Bad in Machine Learning?R - Error in KNN - Test and training differWhen forecasting time series, how does one incorporate the test data back into the model after training?Right ML mode and metric to minimize FN and FP on imbalanced datasetIs the early stopping of xgboost using correctNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.Both train and test error are decreasing in XGBoost iterationswhen can xgboost or catboost be better then Logistic regression?Train and Test Error dependence on size of dataHow do I create a data set that has a set of features for multiple options, with one option being the expected outcome?How do I train Xgboost classifier for ECG Signal data?

Selecting a secure PIN for building access

If a prion is a protein, why is it not disassembled by the digestive system?

Has a commercial or military jet bi-plane ever been manufactured?

Which industry am I working in? Software development or financial services?

Big O Simplification Algebra

How to reply this mail from potential PhD professor?

If Earth is tilted, why is Polaris always above the same spot?

Why is random forest an improvement of decision tree?

CRT Oscilloscope - part of the plot is missing

What was the state of the German rail system in 1944?

Why is C# in the D Major Scale?

How can I support myself financially as a 17 year old with a loan?

Why do we use caret (^) as the symbol for ctrl/control?

Would a 1/1 token with persist dying trigger on death effects a second time?

Short story with physics professor who "brings back the dead" (Asimov or Bradbury?)

Transferring data speed of Fast Ethernet

Can the 歳 counter be used for architecture, furniture etc to tell its age?

Would glacier 'trees' be plausible?

Endgame: Is there significance between this dialogue between Tony and his father?

In Avengers 1, why does Thanos need Loki?

A non-technological, repeating, phenomenon in the sky, holding its position in the sky for hours

I need a disease

Besides the up and down quark, what other quarks are present in daily matter around us?

Junior developer struggles: how to communicate with management?



Given a single discrete data set, how should I divide it into training data and test data?


Why Is Overfitting Bad in Machine Learning?R - Error in KNN - Test and training differWhen forecasting time series, how does one incorporate the test data back into the model after training?Right ML mode and metric to minimize FN and FP on imbalanced datasetIs the early stopping of xgboost using correctNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.Both train and test error are decreasing in XGBoost iterationswhen can xgboost or catboost be better then Logistic regression?Train and Test Error dependence on size of dataHow do I create a data set that has a set of features for multiple options, with one option being the expected outcome?How do I train Xgboost classifier for ECG Signal data?













0












$begingroup$


I have a dataset in libSVM format consisting of 6000 entries, each with 5 indices, and each index has a binary value 1 or 2. Each of the 6000 entries has a label of 1 or 0, and I am trying to use various machine learning algorithms to determine the correct label (0 or 1) given a particular set of 5 indices/values.



For example, consider the following dataset (the real one is 6000 lines):



0 101:1 102:1 103:0 104:1 105:1
0 101:0 102:1 103:0 104:1 105:1
0 101:0 102:1 103:1 104:1 105:1
1 101:1 102:1 103:1 104:1 105:1
1 101:0 102:1 103:0 104:0 105:1
1 101:1 102:1 103:1 104:0 105:0
1 101:0 102:1 103:0 104:0 105:0


For an algorithm that predicts binary classification, like xgboost, conceptually, how do I first use my dataset to train the model, and then apply the model to the data?



I ask because xgboost asks for two files, a data training set and a data test set. It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.



Any help in understanding this concept is much appreciated.










share|improve this question









$endgroup$
















    0












    $begingroup$


    I have a dataset in libSVM format consisting of 6000 entries, each with 5 indices, and each index has a binary value 1 or 2. Each of the 6000 entries has a label of 1 or 0, and I am trying to use various machine learning algorithms to determine the correct label (0 or 1) given a particular set of 5 indices/values.



    For example, consider the following dataset (the real one is 6000 lines):



    0 101:1 102:1 103:0 104:1 105:1
    0 101:0 102:1 103:0 104:1 105:1
    0 101:0 102:1 103:1 104:1 105:1
    1 101:1 102:1 103:1 104:1 105:1
    1 101:0 102:1 103:0 104:0 105:1
    1 101:1 102:1 103:1 104:0 105:0
    1 101:0 102:1 103:0 104:0 105:0


    For an algorithm that predicts binary classification, like xgboost, conceptually, how do I first use my dataset to train the model, and then apply the model to the data?



    I ask because xgboost asks for two files, a data training set and a data test set. It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.



    Any help in understanding this concept is much appreciated.










    share|improve this question









    $endgroup$














      0












      0








      0





      $begingroup$


      I have a dataset in libSVM format consisting of 6000 entries, each with 5 indices, and each index has a binary value 1 or 2. Each of the 6000 entries has a label of 1 or 0, and I am trying to use various machine learning algorithms to determine the correct label (0 or 1) given a particular set of 5 indices/values.



      For example, consider the following dataset (the real one is 6000 lines):



      0 101:1 102:1 103:0 104:1 105:1
      0 101:0 102:1 103:0 104:1 105:1
      0 101:0 102:1 103:1 104:1 105:1
      1 101:1 102:1 103:1 104:1 105:1
      1 101:0 102:1 103:0 104:0 105:1
      1 101:1 102:1 103:1 104:0 105:0
      1 101:0 102:1 103:0 104:0 105:0


      For an algorithm that predicts binary classification, like xgboost, conceptually, how do I first use my dataset to train the model, and then apply the model to the data?



      I ask because xgboost asks for two files, a data training set and a data test set. It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.



      Any help in understanding this concept is much appreciated.










      share|improve this question









      $endgroup$




      I have a dataset in libSVM format consisting of 6000 entries, each with 5 indices, and each index has a binary value 1 or 2. Each of the 6000 entries has a label of 1 or 0, and I am trying to use various machine learning algorithms to determine the correct label (0 or 1) given a particular set of 5 indices/values.



      For example, consider the following dataset (the real one is 6000 lines):



      0 101:1 102:1 103:0 104:1 105:1
      0 101:0 102:1 103:0 104:1 105:1
      0 101:0 102:1 103:1 104:1 105:1
      1 101:1 102:1 103:1 104:1 105:1
      1 101:0 102:1 103:0 104:0 105:1
      1 101:1 102:1 103:1 104:0 105:0
      1 101:0 102:1 103:0 104:0 105:0


      For an algorithm that predicts binary classification, like xgboost, conceptually, how do I first use my dataset to train the model, and then apply the model to the data?



      I ask because xgboost asks for two files, a data training set and a data test set. It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.



      Any help in understanding this concept is much appreciated.







      machine-learning xgboost training






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 10 at 2:39









      jake9115jake9115

      101




      101




















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          In machine learning, it is important to test out the model that you have built on your training data. This is to prevent overfitting. This is why you must split your data into testing and training. There are many different ways to split testing and training. You can randomly split the data set so that 80% of the samples are training and 20% are testing. Something else you may want to consider is using stratified sampling so that the positive labels occur in both testing and training. This is especially important if you only have a few positively labeled samples, as you could easily end up with a test set without any positive samples. In python there is an argument ‘stratify’ that you can use so that the split has balanced classes.






          share|improve this answer









          $endgroup$












          • $begingroup$
            You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
            $endgroup$
            – Pedro Henrique Monforte
            Apr 11 at 3:25










          • $begingroup$
            @PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
            $endgroup$
            – fractalnature
            Apr 11 at 16:15










          • $begingroup$
            I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
            $endgroup$
            – Pedro Henrique Monforte
            Apr 11 at 16:17


















          0












          $begingroup$

          Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:



          from sklearn.datasets import load_svmlight_file, dump_svmlight_file
          from sklearn.model_selection import train_test_split

          # load features and labels
          X, y = load_svmlight_file('path/to/libsvm/data')

          # split into train/test sets (change test_size if you like)
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

          # write the train & test datasets to disk
          dump_svmlight_file(X_train, y_train, 'train.svm')
          dump_svmlight_file(X_train, y_train, 'test.svm')


          In reference to your comment




          It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.




          I would recommend reading about overfitting. In short, overfitting happens if your model is very good at classifying the data that you used to train the model, but performs poorly on unseen data. If you fit a model to a dataset, and then test the model on the same dataset, you will likely get very optimistic estimates for performance that may lead you to believe that your model is much better than it actually is.



          After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.



          Some good references on overfitting:



          • Why Is Overfitting Bad in Machine Learning?

          • Overfitting in Machine Learning: What It Is and How to Prevent It





          share|improve this answer









          $endgroup$













            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49006%2fgiven-a-single-discrete-data-set-how-should-i-divide-it-into-training-data-and%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            In machine learning, it is important to test out the model that you have built on your training data. This is to prevent overfitting. This is why you must split your data into testing and training. There are many different ways to split testing and training. You can randomly split the data set so that 80% of the samples are training and 20% are testing. Something else you may want to consider is using stratified sampling so that the positive labels occur in both testing and training. This is especially important if you only have a few positively labeled samples, as you could easily end up with a test set without any positive samples. In python there is an argument ‘stratify’ that you can use so that the split has balanced classes.






            share|improve this answer









            $endgroup$












            • $begingroup$
              You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
              $endgroup$
              – Pedro Henrique Monforte
              Apr 11 at 3:25










            • $begingroup$
              @PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
              $endgroup$
              – fractalnature
              Apr 11 at 16:15










            • $begingroup$
              I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
              $endgroup$
              – Pedro Henrique Monforte
              Apr 11 at 16:17















            1












            $begingroup$

            In machine learning, it is important to test out the model that you have built on your training data. This is to prevent overfitting. This is why you must split your data into testing and training. There are many different ways to split testing and training. You can randomly split the data set so that 80% of the samples are training and 20% are testing. Something else you may want to consider is using stratified sampling so that the positive labels occur in both testing and training. This is especially important if you only have a few positively labeled samples, as you could easily end up with a test set without any positive samples. In python there is an argument ‘stratify’ that you can use so that the split has balanced classes.






            share|improve this answer









            $endgroup$












            • $begingroup$
              You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
              $endgroup$
              – Pedro Henrique Monforte
              Apr 11 at 3:25










            • $begingroup$
              @PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
              $endgroup$
              – fractalnature
              Apr 11 at 16:15










            • $begingroup$
              I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
              $endgroup$
              – Pedro Henrique Monforte
              Apr 11 at 16:17













            1












            1








            1





            $begingroup$

            In machine learning, it is important to test out the model that you have built on your training data. This is to prevent overfitting. This is why you must split your data into testing and training. There are many different ways to split testing and training. You can randomly split the data set so that 80% of the samples are training and 20% are testing. Something else you may want to consider is using stratified sampling so that the positive labels occur in both testing and training. This is especially important if you only have a few positively labeled samples, as you could easily end up with a test set without any positive samples. In python there is an argument ‘stratify’ that you can use so that the split has balanced classes.






            share|improve this answer









            $endgroup$



            In machine learning, it is important to test out the model that you have built on your training data. This is to prevent overfitting. This is why you must split your data into testing and training. There are many different ways to split testing and training. You can randomly split the data set so that 80% of the samples are training and 20% are testing. Something else you may want to consider is using stratified sampling so that the positive labels occur in both testing and training. This is especially important if you only have a few positively labeled samples, as you could easily end up with a test set without any positive samples. In python there is an argument ‘stratify’ that you can use so that the split has balanced classes.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Apr 11 at 3:08









            fractalnaturefractalnature

            1015




            1015











            • $begingroup$
              You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
              $endgroup$
              – Pedro Henrique Monforte
              Apr 11 at 3:25










            • $begingroup$
              @PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
              $endgroup$
              – fractalnature
              Apr 11 at 16:15










            • $begingroup$
              I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
              $endgroup$
              – Pedro Henrique Monforte
              Apr 11 at 16:17
















            • $begingroup$
              You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
              $endgroup$
              – Pedro Henrique Monforte
              Apr 11 at 3:25










            • $begingroup$
              @PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
              $endgroup$
              – fractalnature
              Apr 11 at 16:15










            • $begingroup$
              I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
              $endgroup$
              – Pedro Henrique Monforte
              Apr 11 at 16:17















            $begingroup$
            You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
            $endgroup$
            – Pedro Henrique Monforte
            Apr 11 at 3:25




            $begingroup$
            You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
            $endgroup$
            – Pedro Henrique Monforte
            Apr 11 at 3:25












            $begingroup$
            @PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
            $endgroup$
            – fractalnature
            Apr 11 at 16:15




            $begingroup$
            @PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
            $endgroup$
            – fractalnature
            Apr 11 at 16:15












            $begingroup$
            I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
            $endgroup$
            – Pedro Henrique Monforte
            Apr 11 at 16:17




            $begingroup$
            I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
            $endgroup$
            – Pedro Henrique Monforte
            Apr 11 at 16:17











            0












            $begingroup$

            Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:



            from sklearn.datasets import load_svmlight_file, dump_svmlight_file
            from sklearn.model_selection import train_test_split

            # load features and labels
            X, y = load_svmlight_file('path/to/libsvm/data')

            # split into train/test sets (change test_size if you like)
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

            # write the train & test datasets to disk
            dump_svmlight_file(X_train, y_train, 'train.svm')
            dump_svmlight_file(X_train, y_train, 'test.svm')


            In reference to your comment




            It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.




            I would recommend reading about overfitting. In short, overfitting happens if your model is very good at classifying the data that you used to train the model, but performs poorly on unseen data. If you fit a model to a dataset, and then test the model on the same dataset, you will likely get very optimistic estimates for performance that may lead you to believe that your model is much better than it actually is.



            After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.



            Some good references on overfitting:



            • Why Is Overfitting Bad in Machine Learning?

            • Overfitting in Machine Learning: What It Is and How to Prevent It





            share|improve this answer









            $endgroup$

















              0












              $begingroup$

              Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:



              from sklearn.datasets import load_svmlight_file, dump_svmlight_file
              from sklearn.model_selection import train_test_split

              # load features and labels
              X, y = load_svmlight_file('path/to/libsvm/data')

              # split into train/test sets (change test_size if you like)
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

              # write the train & test datasets to disk
              dump_svmlight_file(X_train, y_train, 'train.svm')
              dump_svmlight_file(X_train, y_train, 'test.svm')


              In reference to your comment




              It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.




              I would recommend reading about overfitting. In short, overfitting happens if your model is very good at classifying the data that you used to train the model, but performs poorly on unseen data. If you fit a model to a dataset, and then test the model on the same dataset, you will likely get very optimistic estimates for performance that may lead you to believe that your model is much better than it actually is.



              After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.



              Some good references on overfitting:



              • Why Is Overfitting Bad in Machine Learning?

              • Overfitting in Machine Learning: What It Is and How to Prevent It





              share|improve this answer









              $endgroup$















                0












                0








                0





                $begingroup$

                Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:



                from sklearn.datasets import load_svmlight_file, dump_svmlight_file
                from sklearn.model_selection import train_test_split

                # load features and labels
                X, y = load_svmlight_file('path/to/libsvm/data')

                # split into train/test sets (change test_size if you like)
                X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

                # write the train & test datasets to disk
                dump_svmlight_file(X_train, y_train, 'train.svm')
                dump_svmlight_file(X_train, y_train, 'test.svm')


                In reference to your comment




                It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.




                I would recommend reading about overfitting. In short, overfitting happens if your model is very good at classifying the data that you used to train the model, but performs poorly on unseen data. If you fit a model to a dataset, and then test the model on the same dataset, you will likely get very optimistic estimates for performance that may lead you to believe that your model is much better than it actually is.



                After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.



                Some good references on overfitting:



                • Why Is Overfitting Bad in Machine Learning?

                • Overfitting in Machine Learning: What It Is and How to Prevent It





                share|improve this answer









                $endgroup$



                Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:



                from sklearn.datasets import load_svmlight_file, dump_svmlight_file
                from sklearn.model_selection import train_test_split

                # load features and labels
                X, y = load_svmlight_file('path/to/libsvm/data')

                # split into train/test sets (change test_size if you like)
                X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

                # write the train & test datasets to disk
                dump_svmlight_file(X_train, y_train, 'train.svm')
                dump_svmlight_file(X_train, y_train, 'test.svm')


                In reference to your comment




                It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.




                I would recommend reading about overfitting. In short, overfitting happens if your model is very good at classifying the data that you used to train the model, but performs poorly on unseen data. If you fit a model to a dataset, and then test the model on the same dataset, you will likely get very optimistic estimates for performance that may lead you to believe that your model is much better than it actually is.



                After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.



                Some good references on overfitting:



                • Why Is Overfitting Bad in Machine Learning?

                • Overfitting in Machine Learning: What It Is and How to Prevent It






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Apr 10 at 3:09









                timleatharttimleathart

                2,4291029




                2,4291029



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49006%2fgiven-a-single-discrete-data-set-how-should-i-divide-it-into-training-data-and%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High