Could not convert string to float error on KDDCup99 dataset Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsReading from log file and train a model for predictionFailure tolerant factor codingConsistently inconsistent cross-validation results that are wildly different from original model accuracyGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Does increasing the n_estimators parameter in decision trees always increase accuracyscikit-learn classifier reset in loopHow to use two different datasets as train and test sets?Cross validation for highly imbalanced data with undersamplingtrain_test_split function error. ValueError: Found input variables with inconsistent numbers of samples: [6, 27696]ValueError: could not convert string to float: '���'
How would a mousetrap for use in space work?
Central Vacuuming: Is it worth it, and how does it compare to normal vacuuming?
What is Adi Shankara referring to when he says "He has Vajra marks on his feet"?
Can a new player join a group only when a new campaign starts?
If Windows 7 doesn't support WSL, then what is "Subsystem for UNIX-based Applications"?
What is an "asse" in Elizabethan English?
Why do early math courses focus on the cross sections of a cone and not on other 3D objects?
Did any compiler fully use 80-bit floating point?
Why are vacuum tubes still used in amateur radios?
How does light 'choose' between wave and particle behaviour?
Lagrange four-squares theorem --- deterministic complexity
Semigroups with no morphisms between them
C's equality operator on converted pointers
Movie where a circus ringmaster turns people into animals
Misunderstanding of Sylow theory
Putting class ranking in CV, but against dept guidelines
What are the discoveries that have been possible with the rejection of positivism?
Significance of Cersei's obsession with elephants?
Strange behavior of Object.defineProperty() in JavaScript
What happened to Thoros of Myr's flaming sword?
Amount of permutations on an NxNxN Rubik's Cube
In musical terms, what properties are varied by the human voice to produce different words / syllables?
Crossing US/Canada Border for less than 24 hours
Why does it sometimes sound good to play a grace note as a lead in to a note in a melody?
Could not convert string to float error on KDDCup99 dataset
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsReading from log file and train a model for predictionFailure tolerant factor codingConsistently inconsistent cross-validation results that are wildly different from original model accuracyGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Does increasing the n_estimators parameter in decision trees always increase accuracyscikit-learn classifier reset in loopHow to use two different datasets as train and test sets?Cross validation for highly imbalanced data with undersamplingtrain_test_split function error. ValueError: Found input variables with inconsistent numbers of samples: [6, 27696]ValueError: could not convert string to float: '���'
$begingroup$
I am trying to perform a comparison between 5 algorithms against the KDD Cup 99 dataset and the NSL-KDD datasets using Python and I am having an issue when trying to build and evaluate the models against the KDDCup99 dataset and the NSL-KDD dataset.
Whenever I try to run the algorithms on the datasets I get the following error 'could not convert string to float: S0'
This error is produced during the during the evaluation of the 5 models; Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Classification and Regression Trees, Gaussian Naive Bayes and Support Vector Machines.
Here is the code that I am using to evaluate the datasets:
#Load KDD dataset
dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])
# split data into X and y
array = dataset.values
X = array[:,0:41]
Y = array[:,41]
# Split-out validation dataset
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'
# Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,
random_state=seed)
#Here is where the error is spit out
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
print(msg)
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()
Here is a 3 line sample from the KDDcup99 datatset:
0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.
I have tried using label encoding and it still spits out the same error and when I was looking through the sklearn websites, I noticed that the scoring value was for the string type, is this the cause of the issue? and if not, is there a problem with the way I have loaded the dataset?
EDIT I tried removing scoring value from the code and still got the same error.
machine-learning python scikit-learn pandas
$endgroup$
add a comment |
$begingroup$
I am trying to perform a comparison between 5 algorithms against the KDD Cup 99 dataset and the NSL-KDD datasets using Python and I am having an issue when trying to build and evaluate the models against the KDDCup99 dataset and the NSL-KDD dataset.
Whenever I try to run the algorithms on the datasets I get the following error 'could not convert string to float: S0'
This error is produced during the during the evaluation of the 5 models; Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Classification and Regression Trees, Gaussian Naive Bayes and Support Vector Machines.
Here is the code that I am using to evaluate the datasets:
#Load KDD dataset
dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])
# split data into X and y
array = dataset.values
X = array[:,0:41]
Y = array[:,41]
# Split-out validation dataset
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'
# Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,
random_state=seed)
#Here is where the error is spit out
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
print(msg)
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()
Here is a 3 line sample from the KDDcup99 datatset:
0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.
I have tried using label encoding and it still spits out the same error and when I was looking through the sklearn websites, I noticed that the scoring value was for the string type, is this the cause of the issue? and if not, is there a problem with the way I have loaded the dataset?
EDIT I tried removing scoring value from the code and still got the same error.
machine-learning python scikit-learn pandas
$endgroup$
add a comment |
$begingroup$
I am trying to perform a comparison between 5 algorithms against the KDD Cup 99 dataset and the NSL-KDD datasets using Python and I am having an issue when trying to build and evaluate the models against the KDDCup99 dataset and the NSL-KDD dataset.
Whenever I try to run the algorithms on the datasets I get the following error 'could not convert string to float: S0'
This error is produced during the during the evaluation of the 5 models; Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Classification and Regression Trees, Gaussian Naive Bayes and Support Vector Machines.
Here is the code that I am using to evaluate the datasets:
#Load KDD dataset
dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])
# split data into X and y
array = dataset.values
X = array[:,0:41]
Y = array[:,41]
# Split-out validation dataset
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'
# Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,
random_state=seed)
#Here is where the error is spit out
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
print(msg)
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()
Here is a 3 line sample from the KDDcup99 datatset:
0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.
I have tried using label encoding and it still spits out the same error and when I was looking through the sklearn websites, I noticed that the scoring value was for the string type, is this the cause of the issue? and if not, is there a problem with the way I have loaded the dataset?
EDIT I tried removing scoring value from the code and still got the same error.
machine-learning python scikit-learn pandas
$endgroup$
I am trying to perform a comparison between 5 algorithms against the KDD Cup 99 dataset and the NSL-KDD datasets using Python and I am having an issue when trying to build and evaluate the models against the KDDCup99 dataset and the NSL-KDD dataset.
Whenever I try to run the algorithms on the datasets I get the following error 'could not convert string to float: S0'
This error is produced during the during the evaluation of the 5 models; Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Classification and Regression Trees, Gaussian Naive Bayes and Support Vector Machines.
Here is the code that I am using to evaluate the datasets:
#Load KDD dataset
dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])
# split data into X and y
array = dataset.values
X = array[:,0:41]
Y = array[:,41]
# Split-out validation dataset
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'
# Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,
random_state=seed)
#Here is where the error is spit out
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
print(msg)
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()
Here is a 3 line sample from the KDDcup99 datatset:
0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.
I have tried using label encoding and it still spits out the same error and when I was looking through the sklearn websites, I noticed that the scoring value was for the string type, is this the cause of the issue? and if not, is there a problem with the way I have loaded the dataset?
EDIT I tried removing scoring value from the code and still got the same error.
machine-learning python scikit-learn pandas
machine-learning python scikit-learn pandas
edited Feb 5 '17 at 0:26
Scott
asked Feb 4 '17 at 3:17
ScottScott
13116
13116
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:
for column in dataset.columns:
if dataset[column].dtype == type(object):
le = LabelEncoder()
dataset[column] = le.fit_transform(dataset[column])
After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.
$endgroup$
$begingroup$
Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
$endgroup$
– Scott
Feb 5 '17 at 23:50
$begingroup$
It iterates over the dataset just one time and uses the same numerical values for every model
$endgroup$
– feynman410
Feb 6 '17 at 2:34
$begingroup$
How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
$endgroup$
– Scott
Feb 8 '17 at 1:28
$begingroup$
I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
$endgroup$
– feynman410
Feb 8 '17 at 15:51
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16728%2fcould-not-convert-string-to-float-error-on-kddcup99-dataset%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:
for column in dataset.columns:
if dataset[column].dtype == type(object):
le = LabelEncoder()
dataset[column] = le.fit_transform(dataset[column])
After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.
$endgroup$
$begingroup$
Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
$endgroup$
– Scott
Feb 5 '17 at 23:50
$begingroup$
It iterates over the dataset just one time and uses the same numerical values for every model
$endgroup$
– feynman410
Feb 6 '17 at 2:34
$begingroup$
How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
$endgroup$
– Scott
Feb 8 '17 at 1:28
$begingroup$
I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
$endgroup$
– feynman410
Feb 8 '17 at 15:51
add a comment |
$begingroup$
I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:
for column in dataset.columns:
if dataset[column].dtype == type(object):
le = LabelEncoder()
dataset[column] = le.fit_transform(dataset[column])
After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.
$endgroup$
$begingroup$
Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
$endgroup$
– Scott
Feb 5 '17 at 23:50
$begingroup$
It iterates over the dataset just one time and uses the same numerical values for every model
$endgroup$
– feynman410
Feb 6 '17 at 2:34
$begingroup$
How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
$endgroup$
– Scott
Feb 8 '17 at 1:28
$begingroup$
I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
$endgroup$
– feynman410
Feb 8 '17 at 15:51
add a comment |
$begingroup$
I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:
for column in dataset.columns:
if dataset[column].dtype == type(object):
le = LabelEncoder()
dataset[column] = le.fit_transform(dataset[column])
After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.
$endgroup$
I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:
for column in dataset.columns:
if dataset[column].dtype == type(object):
le = LabelEncoder()
dataset[column] = le.fit_transform(dataset[column])
After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.
answered Feb 5 '17 at 1:38
feynman410feynman410
1,748517
1,748517
$begingroup$
Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
$endgroup$
– Scott
Feb 5 '17 at 23:50
$begingroup$
It iterates over the dataset just one time and uses the same numerical values for every model
$endgroup$
– feynman410
Feb 6 '17 at 2:34
$begingroup$
How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
$endgroup$
– Scott
Feb 8 '17 at 1:28
$begingroup$
I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
$endgroup$
– feynman410
Feb 8 '17 at 15:51
add a comment |
$begingroup$
Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
$endgroup$
– Scott
Feb 5 '17 at 23:50
$begingroup$
It iterates over the dataset just one time and uses the same numerical values for every model
$endgroup$
– feynman410
Feb 6 '17 at 2:34
$begingroup$
How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
$endgroup$
– Scott
Feb 8 '17 at 1:28
$begingroup$
I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
$endgroup$
– feynman410
Feb 8 '17 at 15:51
$begingroup$
Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
$endgroup$
– Scott
Feb 5 '17 at 23:50
$begingroup$
Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
$endgroup$
– Scott
Feb 5 '17 at 23:50
$begingroup$
It iterates over the dataset just one time and uses the same numerical values for every model
$endgroup$
– feynman410
Feb 6 '17 at 2:34
$begingroup$
It iterates over the dataset just one time and uses the same numerical values for every model
$endgroup$
– feynman410
Feb 6 '17 at 2:34
$begingroup$
How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
$endgroup$
– Scott
Feb 8 '17 at 1:28
$begingroup$
How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
$endgroup$
– Scott
Feb 8 '17 at 1:28
$begingroup$
I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
$endgroup$
– feynman410
Feb 8 '17 at 15:51
$begingroup$
I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
$endgroup$
– feynman410
Feb 8 '17 at 15:51
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16728%2fcould-not-convert-string-to-float-error-on-kddcup99-dataset%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown