Could not convert string to float error on KDDCup99 dataset Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsReading from log file and train a model for predictionFailure tolerant factor codingConsistently inconsistent cross-validation results that are wildly different from original model accuracyGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Does increasing the n_estimators parameter in decision trees always increase accuracyscikit-learn classifier reset in loopHow to use two different datasets as train and test sets?Cross validation for highly imbalanced data with undersamplingtrain_test_split function error. ValueError: Found input variables with inconsistent numbers of samples: [6, 27696]ValueError: could not convert string to float: '���'

How would a mousetrap for use in space work?

Central Vacuuming: Is it worth it, and how does it compare to normal vacuuming?

What is Adi Shankara referring to when he says "He has Vajra marks on his feet"?

Can a new player join a group only when a new campaign starts?

If Windows 7 doesn't support WSL, then what is "Subsystem for UNIX-based Applications"?

What is an "asse" in Elizabethan English?

Why do early math courses focus on the cross sections of a cone and not on other 3D objects?

Did any compiler fully use 80-bit floating point?

Why are vacuum tubes still used in amateur radios?

How does light 'choose' between wave and particle behaviour?

Lagrange four-squares theorem --- deterministic complexity

Semigroups with no morphisms between them

C's equality operator on converted pointers

Movie where a circus ringmaster turns people into animals

Misunderstanding of Sylow theory

Putting class ranking in CV, but against dept guidelines

What are the discoveries that have been possible with the rejection of positivism?

Significance of Cersei's obsession with elephants?

Strange behavior of Object.defineProperty() in JavaScript

What happened to Thoros of Myr's flaming sword?

Amount of permutations on an NxNxN Rubik's Cube

In musical terms, what properties are varied by the human voice to produce different words / syllables?

Crossing US/Canada Border for less than 24 hours

Why does it sometimes sound good to play a grace note as a lead in to a note in a melody?



Could not convert string to float error on KDDCup99 dataset



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsReading from log file and train a model for predictionFailure tolerant factor codingConsistently inconsistent cross-validation results that are wildly different from original model accuracyGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Does increasing the n_estimators parameter in decision trees always increase accuracyscikit-learn classifier reset in loopHow to use two different datasets as train and test sets?Cross validation for highly imbalanced data with undersamplingtrain_test_split function error. ValueError: Found input variables with inconsistent numbers of samples: [6, 27696]ValueError: could not convert string to float: '���'










2












$begingroup$


I am trying to perform a comparison between 5 algorithms against the KDD Cup 99 dataset and the NSL-KDD datasets using Python and I am having an issue when trying to build and evaluate the models against the KDDCup99 dataset and the NSL-KDD dataset.



Whenever I try to run the algorithms on the datasets I get the following error 'could not convert string to float: S0'



This error is produced during the during the evaluation of the 5 models; Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Classification and Regression Trees, Gaussian Naive Bayes and Support Vector Machines.



Here is the code that I am using to evaluate the datasets:



#Load KDD dataset

dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])


# split data into X and y
array = dataset.values
X = array[:,0:41]
Y = array[:,41]

# Split-out validation dataset
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)

# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = 'accuracy'

# Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,

random_state=seed)

#Here is where the error is spit out

cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
print(msg)


# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()


Here is a 3 line sample from the KDDcup99 datatset:



0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.


I have tried using label encoding and it still spits out the same error and when I was looking through the sklearn websites, I noticed that the scoring value was for the string type, is this the cause of the issue? and if not, is there a problem with the way I have loaded the dataset?



EDIT I tried removing scoring value from the code and still got the same error.










share|improve this question











$endgroup$
















    2












    $begingroup$


    I am trying to perform a comparison between 5 algorithms against the KDD Cup 99 dataset and the NSL-KDD datasets using Python and I am having an issue when trying to build and evaluate the models against the KDDCup99 dataset and the NSL-KDD dataset.



    Whenever I try to run the algorithms on the datasets I get the following error 'could not convert string to float: S0'



    This error is produced during the during the evaluation of the 5 models; Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Classification and Regression Trees, Gaussian Naive Bayes and Support Vector Machines.



    Here is the code that I am using to evaluate the datasets:



    #Load KDD dataset

    dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
    'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
    'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
    'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
    'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
    'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])


    # split data into X and y
    array = dataset.values
    X = array[:,0:41]
    Y = array[:,41]

    # Split-out validation dataset
    validation_size = 0.20
    seed = 7
    X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)

    # Test options and evaluation metric
    num_folds = 7
    num_instances = len(X_train)
    seed = 7
    scoring = 'accuracy'

    # Algorithms
    models = []
    models.append(('LR', LogisticRegression()))
    models.append(('LDA', LinearDiscriminantAnalysis()))
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('CART', DecisionTreeClassifier()))
    models.append(('NB', GaussianNB()))
    models.append(('SVM', SVC()))

    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,

    random_state=seed)

    #Here is where the error is spit out

    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
    print(msg)


    # Compare Algorithms
    fig = plt.figure()
    fig.suptitle('Algorithm Comparison')
    ax = fig.add_subplot(111)
    plt.boxplot(results)
    ax.set_xticklabels(Y)
    plt.show()


    Here is a 3 line sample from the KDDcup99 datatset:



    0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
    0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
    0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.


    I have tried using label encoding and it still spits out the same error and when I was looking through the sklearn websites, I noticed that the scoring value was for the string type, is this the cause of the issue? and if not, is there a problem with the way I have loaded the dataset?



    EDIT I tried removing scoring value from the code and still got the same error.










    share|improve this question











    $endgroup$














      2












      2








      2


      1



      $begingroup$


      I am trying to perform a comparison between 5 algorithms against the KDD Cup 99 dataset and the NSL-KDD datasets using Python and I am having an issue when trying to build and evaluate the models against the KDDCup99 dataset and the NSL-KDD dataset.



      Whenever I try to run the algorithms on the datasets I get the following error 'could not convert string to float: S0'



      This error is produced during the during the evaluation of the 5 models; Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Classification and Regression Trees, Gaussian Naive Bayes and Support Vector Machines.



      Here is the code that I am using to evaluate the datasets:



      #Load KDD dataset

      dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
      'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
      'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
      'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
      'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
      'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])


      # split data into X and y
      array = dataset.values
      X = array[:,0:41]
      Y = array[:,41]

      # Split-out validation dataset
      validation_size = 0.20
      seed = 7
      X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)

      # Test options and evaluation metric
      num_folds = 7
      num_instances = len(X_train)
      seed = 7
      scoring = 'accuracy'

      # Algorithms
      models = []
      models.append(('LR', LogisticRegression()))
      models.append(('LDA', LinearDiscriminantAnalysis()))
      models.append(('KNN', KNeighborsClassifier()))
      models.append(('CART', DecisionTreeClassifier()))
      models.append(('NB', GaussianNB()))
      models.append(('SVM', SVC()))

      # evaluate each model in turn
      results = []
      names = []
      for name, model in models:
      kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,

      random_state=seed)

      #Here is where the error is spit out

      cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
      results.append(cv_results)
      names.append(name)
      msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
      print(msg)


      # Compare Algorithms
      fig = plt.figure()
      fig.suptitle('Algorithm Comparison')
      ax = fig.add_subplot(111)
      plt.boxplot(results)
      ax.set_xticklabels(Y)
      plt.show()


      Here is a 3 line sample from the KDDcup99 datatset:



      0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
      0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
      0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.


      I have tried using label encoding and it still spits out the same error and when I was looking through the sklearn websites, I noticed that the scoring value was for the string type, is this the cause of the issue? and if not, is there a problem with the way I have loaded the dataset?



      EDIT I tried removing scoring value from the code and still got the same error.










      share|improve this question











      $endgroup$




      I am trying to perform a comparison between 5 algorithms against the KDD Cup 99 dataset and the NSL-KDD datasets using Python and I am having an issue when trying to build and evaluate the models against the KDDCup99 dataset and the NSL-KDD dataset.



      Whenever I try to run the algorithms on the datasets I get the following error 'could not convert string to float: S0'



      This error is produced during the during the evaluation of the 5 models; Logistic Regression, Linear Discriminant Analysis, K-Nearest Neighbors, Classification and Regression Trees, Gaussian Naive Bayes and Support Vector Machines.



      Here is the code that I am using to evaluate the datasets:



      #Load KDD dataset

      dataset = pandas.read_csv('Datasets/KDDCUP 99/kddcup.csv', names = ['duration','protocol_type','service','src_bytes','dst_bytes','flag','land','wrong_fragment','urgent',
      'hot','num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations',
      'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','serror_rate',
      'rerror_rate','same_srv_rate','diff_srv_rate','srv_count','srv_serror_rate','srv_rerror_rate','srv_diff_host_rate',
      'dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
      'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','class'])


      # split data into X and y
      array = dataset.values
      X = array[:,0:41]
      Y = array[:,41]

      # Split-out validation dataset
      validation_size = 0.20
      seed = 7
      X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, Y, test_size=validation_size, random_state=seed)

      # Test options and evaluation metric
      num_folds = 7
      num_instances = len(X_train)
      seed = 7
      scoring = 'accuracy'

      # Algorithms
      models = []
      models.append(('LR', LogisticRegression()))
      models.append(('LDA', LinearDiscriminantAnalysis()))
      models.append(('KNN', KNeighborsClassifier()))
      models.append(('CART', DecisionTreeClassifier()))
      models.append(('NB', GaussianNB()))
      models.append(('SVM', SVC()))

      # evaluate each model in turn
      results = []
      names = []
      for name, model in models:
      kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds,

      random_state=seed)

      #Here is where the error is spit out

      cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) # Could not convert string to float happens here. Scoring uses string.
      results.append(cv_results)
      names.append(name)
      msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
      print(msg)


      # Compare Algorithms
      fig = plt.figure()
      fig.suptitle('Algorithm Comparison')
      ax = fig.add_subplot(111)
      plt.boxplot(results)
      ax.set_xticklabels(Y)
      plt.show()


      Here is a 3 line sample from the KDDcup99 datatset:



      0 tcp http SF 215 45076 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 normal.
      0 tcp http SF 162 4528 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 normal.
      0 tcp http SF 236 1228 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 2 2 1 0 0.5 0 0 0 0 0 normal.


      I have tried using label encoding and it still spits out the same error and when I was looking through the sklearn websites, I noticed that the scoring value was for the string type, is this the cause of the issue? and if not, is there a problem with the way I have loaded the dataset?



      EDIT I tried removing scoring value from the code and still got the same error.







      machine-learning python scikit-learn pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Feb 5 '17 at 0:26







      Scott

















      asked Feb 4 '17 at 3:17









      ScottScott

      13116




      13116




















          1 Answer
          1






          active

          oldest

          votes


















          4












          $begingroup$

          I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:



          for column in dataset.columns:
          if dataset[column].dtype == type(object):
          le = LabelEncoder()
          dataset[column] = le.fit_transform(dataset[column])


          After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
            $endgroup$
            – Scott
            Feb 5 '17 at 23:50










          • $begingroup$
            It iterates over the dataset just one time and uses the same numerical values for every model
            $endgroup$
            – feynman410
            Feb 6 '17 at 2:34










          • $begingroup$
            How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
            $endgroup$
            – Scott
            Feb 8 '17 at 1:28










          • $begingroup$
            I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
            $endgroup$
            – feynman410
            Feb 8 '17 at 15:51












          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16728%2fcould-not-convert-string-to-float-error-on-kddcup99-dataset%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          4












          $begingroup$

          I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:



          for column in dataset.columns:
          if dataset[column].dtype == type(object):
          le = LabelEncoder()
          dataset[column] = le.fit_transform(dataset[column])


          After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
            $endgroup$
            – Scott
            Feb 5 '17 at 23:50










          • $begingroup$
            It iterates over the dataset just one time and uses the same numerical values for every model
            $endgroup$
            – feynman410
            Feb 6 '17 at 2:34










          • $begingroup$
            How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
            $endgroup$
            – Scott
            Feb 8 '17 at 1:28










          • $begingroup$
            I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
            $endgroup$
            – feynman410
            Feb 8 '17 at 15:51
















          4












          $begingroup$

          I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:



          for column in dataset.columns:
          if dataset[column].dtype == type(object):
          le = LabelEncoder()
          dataset[column] = le.fit_transform(dataset[column])


          After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
            $endgroup$
            – Scott
            Feb 5 '17 at 23:50










          • $begingroup$
            It iterates over the dataset just one time and uses the same numerical values for every model
            $endgroup$
            – feynman410
            Feb 6 '17 at 2:34










          • $begingroup$
            How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
            $endgroup$
            – Scott
            Feb 8 '17 at 1:28










          • $begingroup$
            I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
            $endgroup$
            – feynman410
            Feb 8 '17 at 15:51














          4












          4








          4





          $begingroup$

          I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:



          for column in dataset.columns:
          if dataset[column].dtype == type(object):
          le = LabelEncoder()
          dataset[column] = le.fit_transform(dataset[column])


          After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.






          share|improve this answer









          $endgroup$



          I notice you mentioned that you used Label encoding but I did it myself and the code runs just fine. I used the 10 percent version of the dataset . Just put this piece of code after you load the dataset:



          for column in dataset.columns:
          if dataset[column].dtype == type(object):
          le = LabelEncoder()
          dataset[column] = le.fit_transform(dataset[column])


          After label encoding you should use a One Hot Encoder to improve the performance of some algorithms. You should also avoid using cross_validation module as it is deprecated, it will be removed in version 0.20.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Feb 5 '17 at 1:38









          feynman410feynman410

          1,748517




          1,748517











          • $begingroup$
            Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
            $endgroup$
            – Scott
            Feb 5 '17 at 23:50










          • $begingroup$
            It iterates over the dataset just one time and uses the same numerical values for every model
            $endgroup$
            – feynman410
            Feb 6 '17 at 2:34










          • $begingroup$
            How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
            $endgroup$
            – Scott
            Feb 8 '17 at 1:28










          • $begingroup$
            I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
            $endgroup$
            – feynman410
            Feb 8 '17 at 15:51

















          • $begingroup$
            Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
            $endgroup$
            – Scott
            Feb 5 '17 at 23:50










          • $begingroup$
            It iterates over the dataset just one time and uses the same numerical values for every model
            $endgroup$
            – feynman410
            Feb 6 '17 at 2:34










          • $begingroup$
            How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
            $endgroup$
            – Scott
            Feb 8 '17 at 1:28










          • $begingroup$
            I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
            $endgroup$
            – feynman410
            Feb 8 '17 at 15:51
















          $begingroup$
          Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
          $endgroup$
          – Scott
          Feb 5 '17 at 23:50




          $begingroup$
          Thanks! This works - Does it iterate over the dataset every time and generate numerical values each loop? or does it use the same numerical values for every model?
          $endgroup$
          – Scott
          Feb 5 '17 at 23:50












          $begingroup$
          It iterates over the dataset just one time and uses the same numerical values for every model
          $endgroup$
          – feynman410
          Feb 6 '17 at 2:34




          $begingroup$
          It iterates over the dataset just one time and uses the same numerical values for every model
          $endgroup$
          – feynman410
          Feb 6 '17 at 2:34












          $begingroup$
          How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
          $endgroup$
          – Scott
          Feb 8 '17 at 1:28




          $begingroup$
          How long did it take for you to complete the algorithms and get an output? as my machine seems to be taking a while to perform the actions.
          $endgroup$
          – Scott
          Feb 8 '17 at 1:28












          $begingroup$
          I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
          $endgroup$
          – feynman410
          Feb 8 '17 at 15:51





          $begingroup$
          I took a while, but i think this is because of the size of the dataset. For testting purposes you could use a subset of the data
          $endgroup$
          – feynman410
          Feb 8 '17 at 15:51


















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16728%2fcould-not-convert-string-to-float-error-on-kddcup99-dataset%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High