Given a single discrete data set, how should I divide it into training data and test data?Why Is Overfitting Bad in Machine Learning?R - Error in KNN - Test and training differWhen forecasting time series, how does one incorporate the test data back into the model after training?Right ML mode and metric to minimize FN and FP on imbalanced datasetIs the early stopping of xgboost using correctNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.Both train and test error are decreasing in XGBoost iterationswhen can xgboost or catboost be better then Logistic regression?Train and Test Error dependence on size of dataHow do I create a data set that has a set of features for multiple options, with one option being the expected outcome?How do I train Xgboost classifier for ECG Signal data?

Selecting a secure PIN for building access

If a prion is a protein, why is it not disassembled by the digestive system?

Has a commercial or military jet bi-plane ever been manufactured?

Which industry am I working in? Software development or financial services?

Big O Simplification Algebra

How to reply this mail from potential PhD professor?

If Earth is tilted, why is Polaris always above the same spot?

Why is random forest an improvement of decision tree?

CRT Oscilloscope - part of the plot is missing

What was the state of the German rail system in 1944?

Why is C# in the D Major Scale?

How can I support myself financially as a 17 year old with a loan?

Why do we use caret (^) as the symbol for ctrl/control?

Would a 1/1 token with persist dying trigger on death effects a second time?

Short story with physics professor who "brings back the dead" (Asimov or Bradbury?)

Transferring data speed of Fast Ethernet

Can the 歳 counter be used for architecture, furniture etc to tell its age?

Would glacier 'trees' be plausible?

Endgame: Is there significance between this dialogue between Tony and his father?

In Avengers 1, why does Thanos need Loki?

A non-technological, repeating, phenomenon in the sky, holding its position in the sky for hours

I need a disease

Besides the up and down quark, what other quarks are present in daily matter around us?

Junior developer struggles: how to communicate with management?

Given a single discrete data set, how should I divide it into training data and test data?

Why Is Overfitting Bad in Machine Learning?R - Error in KNN - Test and training differWhen forecasting time series, how does one incorporate the test data back into the model after training?Right ML mode and metric to minimize FN and FP on imbalanced datasetIs the early stopping of xgboost using correctNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.Both train and test error are decreasing in XGBoost iterationswhen can xgboost or catboost be better then Logistic regression?Train and Test Error dependence on size of dataHow do I create a data set that has a set of features for multiple options, with one option being the expected outcome?How do I train Xgboost classifier for ECG Signal data?

I have a dataset in libSVM format consisting of 6000 entries, each with 5 indices, and each index has a binary value 1 or 2. Each of the 6000 entries has a label of 1 or 0, and I am trying to use various machine learning algorithms to determine the correct label (0 or 1) given a particular set of 5 indices/values.

For example, consider the following dataset (the real one is 6000 lines):

0 101:1 102:1 103:0 104:1 105:1
0 101:0 102:1 103:0 104:1 105:1
0 101:0 102:1 103:1 104:1 105:1
1 101:1 102:1 103:1 104:1 105:1
1 101:0 102:1 103:0 104:0 105:1
1 101:1 102:1 103:1 104:0 105:0
1 101:0 102:1 103:0 104:0 105:0

For an algorithm that predicts binary classification, like xgboost, conceptually, how do I first use my dataset to train the model, and then apply the model to the data?

I ask because xgboost asks for two files, a data training set and a data test set. It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.

Any help in understanding this concept is much appreciated.

asked Apr 10 at 2:39

jake9115

101

add a comment |

For example, consider the following dataset (the real one is 6000 lines):

0 101:1 102:1 103:0 104:1 105:1
0 101:0 102:1 103:0 104:1 105:1
0 101:0 102:1 103:1 104:1 105:1
1 101:1 102:1 103:1 104:1 105:1
1 101:0 102:1 103:0 104:0 105:1
1 101:1 102:1 103:1 104:0 105:0
1 101:0 102:1 103:0 104:0 105:0

For an algorithm that predicts binary classification, like xgboost, conceptually, how do I first use my dataset to train the model, and then apply the model to the data?

Any help in understanding this concept is much appreciated.

asked Apr 10 at 2:39

jake9115

101

add a comment |

For example, consider the following dataset (the real one is 6000 lines):

0 101:1 102:1 103:0 104:1 105:1
0 101:0 102:1 103:0 104:1 105:1
0 101:0 102:1 103:1 104:1 105:1
1 101:1 102:1 103:1 104:1 105:1
1 101:0 102:1 103:0 104:0 105:1
1 101:1 102:1 103:1 104:0 105:0
1 101:0 102:1 103:0 104:0 105:0

For an algorithm that predicts binary classification, like xgboost, conceptually, how do I first use my dataset to train the model, and then apply the model to the data?

Any help in understanding this concept is much appreciated.

asked Apr 10 at 2:39

jake9115

101

For example, consider the following dataset (the real one is 6000 lines):

0 101:1 102:1 103:0 104:1 105:1
0 101:0 102:1 103:0 104:1 105:1
0 101:0 102:1 103:1 104:1 105:1
1 101:1 102:1 103:1 104:1 105:1
1 101:0 102:1 103:0 104:0 105:1
1 101:1 102:1 103:1 104:0 105:0
1 101:0 102:1 103:0 104:0 105:0

For an algorithm that predicts binary classification, like xgboost, conceptually, how do I first use my dataset to train the model, and then apply the model to the data?

Any help in understanding this concept is much appreciated.

machine-learning xgboost training

asked Apr 10 at 2:39

jake9115

101

asked Apr 10 at 2:39

jake9115

101

asked Apr 10 at 2:39

jake9115

101

asked Apr 10 at 2:39

jake9115

101

asked Apr 10 at 2:39

jake9115

101

add a comment |

2 Answers
2

active

oldest

votes

In machine learning, it is important to test out the model that you have built on your training data. This is to prevent overfitting. This is why you must split your data into testing and training. There are many different ways to split testing and training. You can randomly split the data set so that 80% of the samples are training and 20% are testing. Something else you may want to consider is using stratified sampling so that the positive labels occur in both testing and training. This is especially important if you only have a few positively labeled samples, as you could easily end up with a test set without any positive samples. In python there is an argument ‘stratify’ that you can use so that the split has balanced classes.

answered Apr 11 at 3:08

fractalnature

1015

$begingroup$
You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 3:25

$begingroup$
@PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
$endgroup$
– fractalnature
Apr 11 at 16:15

$begingroup$
I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 16:17

add a comment |

Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:

from sklearn.datasets import load_svmlight_file, dump_svmlight_file
from sklearn.model_selection import train_test_split

# load features and labels
X, y = load_svmlight_file('path/to/libsvm/data')

# split into train/test sets (change test_size if you like)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# write the train & test datasets to disk
dump_svmlight_file(X_train, y_train, 'train.svm')
dump_svmlight_file(X_train, y_train, 'test.svm')

In reference to your comment

It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.

I would recommend reading about overfitting. In short, overfitting happens if your model is very good at classifying the data that you used to train the model, but performs poorly on unseen data. If you fit a model to a dataset, and then test the model on the same dataset, you will likely get very optimistic estimates for performance that may lead you to believe that your model is much better than it actually is.

After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.

Some good references on overfitting:

Why Is Overfitting Bad in Machine Learning?

Overfitting in Machine Learning: What It Is and How to Prevent It

answered Apr 10 at 3:09

timleathart

2,4291029

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49006%2fgiven-a-single-discrete-data-set-how-should-i-divide-it-into-training-data-and%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Apr 11 at 3:08

fractalnature

1015

$begingroup$
You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 3:25

$begingroup$
@PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
$endgroup$
– fractalnature
Apr 11 at 16:15

$begingroup$
I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 16:17

add a comment |

answered Apr 11 at 3:08

fractalnature

1015

$begingroup$
You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 3:25

$begingroup$
@PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
$endgroup$
– fractalnature
Apr 11 at 16:15

$begingroup$
I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 16:17

add a comment |

answered Apr 11 at 3:08

fractalnature

1015

answered Apr 11 at 3:08

fractalnature

1015

answered Apr 11 at 3:08

fractalnature

1015

answered Apr 11 at 3:08

fractalnature

1015

answered Apr 11 at 3:08

fractalnature

1015

$begingroup$
You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 3:25

$begingroup$
@PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
$endgroup$
– fractalnature
Apr 11 at 16:15

$begingroup$
I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 16:17

add a comment |

$begingroup$
You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 3:25

$begingroup$
@PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.
$endgroup$
– fractalnature
Apr 11 at 16:15

$begingroup$
I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer
$endgroup$
– Pedro Henrique Monforte
Apr 11 at 16:17

You can also use crossvalidation: Split your data into $k$ parts, train with $k-1$ and test with 1, repeat until all parts have been used for testing. This is called $k$-fold cross validation and will tell you how trustable is your model (by the variance in results) and how precise (by the mean result). Also you can do leave-on-out crossvalidation which is the same as a $k$-fold cv where $k$ equals to the size of dataset.

– Pedro Henrique Monforte
Apr 11 at 3:25

@PedroHenriqueMonforte I completely agree, cross validation is a better route than just using one test set. However -- since this individual seems to be new to ML and confused about the general concept, I decided to keep my response limited.

– fractalnature
Apr 11 at 16:15

I understood your approach, this is why i just added a comment for others looking for it to see and not edit your answer. Also I've upvoted your answer

– Pedro Henrique Monforte
Apr 11 at 16:17

add a comment |

Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:

from sklearn.datasets import load_svmlight_file, dump_svmlight_file
from sklearn.model_selection import train_test_split

# load features and labels
X, y = load_svmlight_file('path/to/libsvm/data')

# split into train/test sets (change test_size if you like)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# write the train & test datasets to disk
dump_svmlight_file(X_train, y_train, 'train.svm')
dump_svmlight_file(X_train, y_train, 'test.svm')

In reference to your comment

It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.

After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.

Some good references on overfitting:

Why Is Overfitting Bad in Machine Learning?

Overfitting in Machine Learning: What It Is and How to Prevent It

answered Apr 10 at 3:09

timleathart

2,4291029

add a comment |

Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:

from sklearn.datasets import load_svmlight_file, dump_svmlight_file
from sklearn.model_selection import train_test_split

# load features and labels
X, y = load_svmlight_file('path/to/libsvm/data')

# split into train/test sets (change test_size if you like)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# write the train & test datasets to disk
dump_svmlight_file(X_train, y_train, 'train.svm')
dump_svmlight_file(X_train, y_train, 'test.svm')

In reference to your comment

It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.

After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.

Some good references on overfitting:

Why Is Overfitting Bad in Machine Learning?

Overfitting in Machine Learning: What It Is and How to Prevent It

answered Apr 10 at 3:09

timleathart

2,4291029

add a comment |

Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:

from sklearn.datasets import load_svmlight_file, dump_svmlight_file
from sklearn.model_selection import train_test_split

# load features and labels
X, y = load_svmlight_file('path/to/libsvm/data')

# split into train/test sets (change test_size if you like)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# write the train & test datasets to disk
dump_svmlight_file(X_train, y_train, 'train.svm')
dump_svmlight_file(X_train, y_train, 'test.svm')

In reference to your comment

It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.

After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.

Some good references on overfitting:

Why Is Overfitting Bad in Machine Learning?

Overfitting in Machine Learning: What It Is and How to Prevent It

answered Apr 10 at 3:09

timleathart

2,4291029

Assuming you are using Python, an easy way to do this is to use utilities available in scikit-learn:

from sklearn.datasets import load_svmlight_file, dump_svmlight_file
from sklearn.model_selection import train_test_split

# load features and labels
X, y = load_svmlight_file('path/to/libsvm/data')

# split into train/test sets (change test_size if you like)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# write the train & test datasets to disk
dump_svmlight_file(X_train, y_train, 'train.svm')
dump_svmlight_file(X_train, y_train, 'test.svm')

In reference to your comment

It seems to me that the algorithm should just require a single full set of data, use all of the data to train and build a model, and then apply that model to the original data set and determine if the labels are being assigned "0 or 1" accurately.

After finding a set of hyper-parameters that work well and testing to ensure that your model isn't overfitting, you can train the model on the full dataset using the hyper-parameters that worked.

Some good references on overfitting:

Why Is Overfitting Bad in Machine Learning?

Overfitting in Machine Learning: What It Is and How to Prevent It

answered Apr 10 at 3:09

timleathart

2,4291029

answered Apr 10 at 3:09

timleathart

2,4291029

answered Apr 10 at 3:09

timleathart

2,4291029

answered Apr 10 at 3:09

timleathart

2,4291029

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2