ValueError: could not convert string to float: '���'2019 Community Moderator ElectionBatch convert json links to csvCould not convert string to float error on KDDCup99 datasetGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Tensorflow: can not convert float into a tensor?Convert List to DataFrameValueError: operands could not be broadcast together with shapes while using two sample independent t testValueError: operands could not be broadcast together with shapes (60002,39) (38,) during pca.transformTypeError: float() argument must be a string or a number, not 'function'ValueError: operands could not be broadcast together with shapes (140,) (10230,)Inputting (a lot of )data into a dataframe one row at a time

Processor speed limited at 0.4 GHz

Am I breaking OOP practice with this architecture?

Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?

Meaning of "Around In"

What historical events would have to change in order to make 19th century "steampunk" technology possible?

Why is the sentence "Das ist eine Nase" correct?

Array of objects return object when condition matched

How can saying a song's name be a copyright violation?

Why do I get negative height?

How do conventional missiles fly?

How to show a landlord what we have in savings?

Amending the P2P Layer

Check for characters in a string being unique

Is it possible to create a QR code using text?

Why were 5.25" floppy drives cheaper than 8"?

How badly should I try to prevent a user from XSSing themselves?

Could the museum Saturn V's be refitted for one more flight?

Do Iron Man suits sport waste management systems?

Using "tail" to follow a file without displaying the most recent lines

How can I deal with my CEO asking me to hire someone with a higher salary than me, a co-founder?

How to properly check if the given string is empty in a POSIX shell script?

What is the fastest integer factorization to break RSA?

Venezuelan girlfriend wants to travel the USA to be with me. What is the process?

In Bayesian inference, why are some terms dropped from the posterior predictive?



ValueError: could not convert string to float: '���'



2019 Community Moderator ElectionBatch convert json links to csvCould not convert string to float error on KDDCup99 datasetGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Tensorflow: can not convert float into a tensor?Convert List to DataFrameValueError: operands could not be broadcast together with shapes while using two sample independent t testValueError: operands could not be broadcast together with shapes (60002,39) (38,) during pca.transformTypeError: float() argument must be a string or a number, not 'function'ValueError: operands could not be broadcast together with shapes (140,) (10230,)Inputting (a lot of )data into a dataframe one row at a time










4












$begingroup$


I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters.



array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')


When I convert it to a float datatype, using



X_f = X.astype(float)


I get the error as shown above. how to solve this string formatting error for '���'?



I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.



My questions:-



  1. How do I handle this misreading?

  2. Should I ignore these characters? Or should I transform them to zero maybe?

Additional Information on how the data was read:-



importing relevant packages



from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col


loading the dataset in a pyspark dataframe



def loading_data(dataset):
dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
# #changing column header name
dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
#to change datatype
dataset=dataset.drop('External IP')
dataset = dataset.filter(dataset.Label.isNotNull())
dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
print(dataset.groupBy('Label').count().collect())
return dataset

# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)


check type of dataset.



type(dataset)


pyspark.sql.dataframe.DataFrame



convert to np array



import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())


split features and labels



X = np_dfr[:,0:22]
Y = np_dfr[:,-1]


show X



>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')









share|improve this question









$endgroup$
















    4












    $begingroup$


    I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters.



    array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
    ['50905', '0', '0', ..., '110', '0', '0'],
    ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
    ...,
    ['85', '0', '0', ..., '1980', '0', '0'],
    ['233', '54', '27', ..., '-1', '0', '0'],
    ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')


    When I convert it to a float datatype, using



    X_f = X.astype(float)


    I get the error as shown above. how to solve this string formatting error for '���'?



    I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.



    My questions:-



    1. How do I handle this misreading?

    2. Should I ignore these characters? Or should I transform them to zero maybe?

    Additional Information on how the data was read:-



    importing relevant packages



    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from pyspark.sql.functions import col


    loading the dataset in a pyspark dataframe



    def loading_data(dataset):
    dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
    # #changing column header name
    dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
    #to change datatype
    dataset=dataset.drop('External IP')
    dataset = dataset.filter(dataset.Label.isNotNull())
    dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
    print(dataset.groupBy('Label').count().collect())
    return dataset

    # invoking
    ds_path = '../final.csv'
    dataset=loading_data(ds_path)


    check type of dataset.



    type(dataset)


    pyspark.sql.dataframe.DataFrame



    convert to np array



    import numpy as np
    np_dfr = np.array(data_preprocessing(dataset).collect())


    split features and labels



    X = np_dfr[:,0:22]
    Y = np_dfr[:,-1]


    show X



    >> X
    array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
    ['50905', '0', '0', ..., '110', '0', '0'],
    ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
    ...,
    ['85', '0', '0', ..., '1980', '0', '0'],
    ['233', '54', '27', ..., '-1', '0', '0'],
    ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')









    share|improve this question









    $endgroup$














      4












      4








      4


      1



      $begingroup$


      I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters.



      array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
      ['50905', '0', '0', ..., '110', '0', '0'],
      ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
      ...,
      ['85', '0', '0', ..., '1980', '0', '0'],
      ['233', '54', '27', ..., '-1', '0', '0'],
      ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')


      When I convert it to a float datatype, using



      X_f = X.astype(float)


      I get the error as shown above. how to solve this string formatting error for '���'?



      I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.



      My questions:-



      1. How do I handle this misreading?

      2. Should I ignore these characters? Or should I transform them to zero maybe?

      Additional Information on how the data was read:-



      importing relevant packages



      from pyspark import SparkContext
      from pyspark.sql import SQLContext
      from pyspark.sql.functions import col


      loading the dataset in a pyspark dataframe



      def loading_data(dataset):
      dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
      # #changing column header name
      dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
      #to change datatype
      dataset=dataset.drop('External IP')
      dataset = dataset.filter(dataset.Label.isNotNull())
      dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
      print(dataset.groupBy('Label').count().collect())
      return dataset

      # invoking
      ds_path = '../final.csv'
      dataset=loading_data(ds_path)


      check type of dataset.



      type(dataset)


      pyspark.sql.dataframe.DataFrame



      convert to np array



      import numpy as np
      np_dfr = np.array(data_preprocessing(dataset).collect())


      split features and labels



      X = np_dfr[:,0:22]
      Y = np_dfr[:,-1]


      show X



      >> X
      array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
      ['50905', '0', '0', ..., '110', '0', '0'],
      ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
      ...,
      ['85', '0', '0', ..., '1980', '0', '0'],
      ['233', '54', '27', ..., '-1', '0', '0'],
      ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')









      share|improve this question









      $endgroup$




      I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters.



      array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
      ['50905', '0', '0', ..., '110', '0', '0'],
      ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
      ...,
      ['85', '0', '0', ..., '1980', '0', '0'],
      ['233', '54', '27', ..., '-1', '0', '0'],
      ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')


      When I convert it to a float datatype, using



      X_f = X.astype(float)


      I get the error as shown above. how to solve this string formatting error for '���'?



      I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.



      My questions:-



      1. How do I handle this misreading?

      2. Should I ignore these characters? Or should I transform them to zero maybe?

      Additional Information on how the data was read:-



      importing relevant packages



      from pyspark import SparkContext
      from pyspark.sql import SQLContext
      from pyspark.sql.functions import col


      loading the dataset in a pyspark dataframe



      def loading_data(dataset):
      dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
      # #changing column header name
      dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
      #to change datatype
      dataset=dataset.drop('External IP')
      dataset = dataset.filter(dataset.Label.isNotNull())
      dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
      print(dataset.groupBy('Label').count().collect())
      return dataset

      # invoking
      ds_path = '../final.csv'
      dataset=loading_data(ds_path)


      check type of dataset.



      type(dataset)


      pyspark.sql.dataframe.DataFrame



      convert to np array



      import numpy as np
      np_dfr = np.array(data_preprocessing(dataset).collect())


      split features and labels



      X = np_dfr[:,0:22]
      Y = np_dfr[:,-1]


      show X



      >> X
      array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
      ['50905', '0', '0', ..., '110', '0', '0'],
      ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
      ...,
      ['85', '0', '0', ..., '1980', '0', '0'],
      ['233', '54', '27', ..., '-1', '0', '0'],
      ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')






      python dataframe csv data-formats






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 26 at 17:18









      venom8914venom8914

      1212




      1212




















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          Though not the best solution, I found some success by converting it into pandas dataframe and working along.



          code snippet



          # convert X into dataframe
          X_pd = pd.DataFrame(data=X)
          # replace all instances of URC with 0
          X_replace = X_pd.replace('�',0, regex=True)
          # convert it back to numpy array
          X_np = X_replace.values
          # set the object type as float
          X_fa = X_np.astype(float)


          input



          array([['85', '0', '0', '1980', '0', '0'],
          ['233', '54', '27', '-1', '0', '0'],
          ['���', '�', '�����', '�', '��', '���']], dtype='<U5')


          output



          array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
          [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
          [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])





          share|improve this answer








          New contributor




          venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48049%2fvalueerror-could-not-convert-string-to-float%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            Though not the best solution, I found some success by converting it into pandas dataframe and working along.



            code snippet



            # convert X into dataframe
            X_pd = pd.DataFrame(data=X)
            # replace all instances of URC with 0
            X_replace = X_pd.replace('�',0, regex=True)
            # convert it back to numpy array
            X_np = X_replace.values
            # set the object type as float
            X_fa = X_np.astype(float)


            input



            array([['85', '0', '0', '1980', '0', '0'],
            ['233', '54', '27', '-1', '0', '0'],
            ['���', '�', '�����', '�', '��', '���']], dtype='<U5')


            output



            array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
            [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
            [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])





            share|improve this answer








            New contributor




            venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$

















              0












              $begingroup$

              Though not the best solution, I found some success by converting it into pandas dataframe and working along.



              code snippet



              # convert X into dataframe
              X_pd = pd.DataFrame(data=X)
              # replace all instances of URC with 0
              X_replace = X_pd.replace('�',0, regex=True)
              # convert it back to numpy array
              X_np = X_replace.values
              # set the object type as float
              X_fa = X_np.astype(float)


              input



              array([['85', '0', '0', '1980', '0', '0'],
              ['233', '54', '27', '-1', '0', '0'],
              ['���', '�', '�����', '�', '��', '���']], dtype='<U5')


              output



              array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
              [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
              [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])





              share|improve this answer








              New contributor




              venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$















                0












                0








                0





                $begingroup$

                Though not the best solution, I found some success by converting it into pandas dataframe and working along.



                code snippet



                # convert X into dataframe
                X_pd = pd.DataFrame(data=X)
                # replace all instances of URC with 0
                X_replace = X_pd.replace('�',0, regex=True)
                # convert it back to numpy array
                X_np = X_replace.values
                # set the object type as float
                X_fa = X_np.astype(float)


                input



                array([['85', '0', '0', '1980', '0', '0'],
                ['233', '54', '27', '-1', '0', '0'],
                ['���', '�', '�����', '�', '��', '���']], dtype='<U5')


                output



                array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
                [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
                [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])





                share|improve this answer








                New contributor




                venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                Though not the best solution, I found some success by converting it into pandas dataframe and working along.



                code snippet



                # convert X into dataframe
                X_pd = pd.DataFrame(data=X)
                # replace all instances of URC with 0
                X_replace = X_pd.replace('�',0, regex=True)
                # convert it back to numpy array
                X_np = X_replace.values
                # set the object type as float
                X_fa = X_np.astype(float)


                input



                array([['85', '0', '0', '1980', '0', '0'],
                ['233', '54', '27', '-1', '0', '0'],
                ['���', '�', '�����', '�', '��', '���']], dtype='<U5')


                output



                array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
                [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
                [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])






                share|improve this answer








                New contributor




                venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer






                New contributor




                venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered Mar 27 at 13:31









                venom8914venom8914

                1212




                1212




                New contributor




                venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48049%2fvalueerror-could-not-convert-string-to-float%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High