ValueError: could not convert string to float: '��'2019 Community Moderator ElectionBatch convert json links to csvCould not convert string to float error on KDDCup99 datasetGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Tensorflow: can not convert float into a tensor?Convert List to DataFrameValueError: operands could not be broadcast together with shapes while using two sample independent t testValueError: operands could not be broadcast together with shapes (60002,39) (38,) during pca.transformTypeError: float() argument must be a string or a number, not 'function'ValueError: operands could not be broadcast together with shapes (140,) (10230,)Inputting (a lot of )data into a dataframe one row at a time

Processor speed limited at 0.4 GHz

Am I breaking OOP practice with this architecture?

Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?

Meaning of "Around In"

What historical events would have to change in order to make 19th century "steampunk" technology possible?

Why is the sentence "Das ist eine Nase" correct?

Array of objects return object when condition matched

How can saying a song's name be a copyright violation?

Why do I get negative height?

How do conventional missiles fly?

How to show a landlord what we have in savings?

Amending the P2P Layer

Check for characters in a string being unique

Is it possible to create a QR code using text?

Why were 5.25" floppy drives cheaper than 8"?

How badly should I try to prevent a user from XSSing themselves?

Could the museum Saturn V's be refitted for one more flight?

Do Iron Man suits sport waste management systems?

Using "tail" to follow a file without displaying the most recent lines

How can I deal with my CEO asking me to hire someone with a higher salary than me, a co-founder?

How to properly check if the given string is empty in a POSIX shell script?

What is the fastest integer factorization to break RSA?

Venezuelan girlfriend wants to travel the USA to be with me. What is the process?

In Bayesian inference, why are some terms dropped from the posterior predictive?

ValueError: could not convert string to float: '��'

2019 Community Moderator ElectionBatch convert json links to csvCould not convert string to float error on KDDCup99 datasetGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Tensorflow: can not convert float into a tensor?Convert List to DataFrameValueError: operands could not be broadcast together with shapes while using two sample independent t testValueError: operands could not be broadcast together with shapes (60002,39) (38,) during pca.transformTypeError: float() argument must be a string or a number, not 'function'ValueError: operands could not be broadcast together with shapes (140,) (10230,)Inputting (a lot of )data into a dataframe one row at a time

I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters.

array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
 ['50905', '0', '0', ..., '110', '0', '0'],
 ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
 ...,
 ['85', '0', '0', ..., '1980', '0', '0'],
 ['233', '54', '27', ..., '-1', '0', '0'],
 ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')

When I convert it to a float datatype, using

X_f = X.astype(float)

I get the error as shown above. how to solve this string formatting error for '��'?

I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.

My questions:-

How do I handle this misreading?

Should I ignore these characters? Or should I transform them to zero maybe?

Additional Information on how the data was read:-

importing relevant packages

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col

loading the dataset in a pyspark dataframe

def loading_data(dataset):
 dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
 # #changing column header name
 dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
 #to change datatype
 dataset=dataset.drop('External IP')
 dataset = dataset.filter(dataset.Label.isNotNull())
 dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
 print(dataset.groupBy('Label').count().collect())
 return dataset

# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)

check type of dataset.

type(dataset)

pyspark.sql.dataframe.DataFrame

convert to np array

import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())

split features and labels

X = np_dfr[:,0:22]
Y = np_dfr[:,-1]

show X

>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
 ['50905', '0', '0', ..., '110', '0', '0'],
 ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
 ...,
 ['85', '0', '0', ..., '1980', '0', '0'],
 ['233', '54', '27', ..., '-1', '0', '0'],
 ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')

asked Mar 26 at 17:18

venom8914

1212

add a comment |

I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters.

array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
 ['50905', '0', '0', ..., '110', '0', '0'],
 ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
 ...,
 ['85', '0', '0', ..., '1980', '0', '0'],
 ['233', '54', '27', ..., '-1', '0', '0'],
 ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')

When I convert it to a float datatype, using

X_f = X.astype(float)

I get the error as shown above. how to solve this string formatting error for '��'?

I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.

My questions:-

How do I handle this misreading?

Should I ignore these characters? Or should I transform them to zero maybe?

Additional Information on how the data was read:-

importing relevant packages

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col

loading the dataset in a pyspark dataframe

def loading_data(dataset):
 dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
 # #changing column header name
 dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
 #to change datatype
 dataset=dataset.drop('External IP')
 dataset = dataset.filter(dataset.Label.isNotNull())
 dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
 print(dataset.groupBy('Label').count().collect())
 return dataset

# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)

check type of dataset.

type(dataset)

pyspark.sql.dataframe.DataFrame

convert to np array

import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())

split features and labels

X = np_dfr[:,0:22]
Y = np_dfr[:,-1]

show X

>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
 ['50905', '0', '0', ..., '110', '0', '0'],
 ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
 ...,
 ['85', '0', '0', ..., '1980', '0', '0'],
 ['233', '54', '27', ..., '-1', '0', '0'],
 ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')

asked Mar 26 at 17:18

venom8914

1212

add a comment |

I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters.

array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
 ['50905', '0', '0', ..., '110', '0', '0'],
 ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
 ...,
 ['85', '0', '0', ..., '1980', '0', '0'],
 ['233', '54', '27', ..., '-1', '0', '0'],
 ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')

When I convert it to a float datatype, using

X_f = X.astype(float)

I get the error as shown above. how to solve this string formatting error for '��'?

I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.

My questions:-

How do I handle this misreading?

Should I ignore these characters? Or should I transform them to zero maybe?

Additional Information on how the data was read:-

importing relevant packages

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col

loading the dataset in a pyspark dataframe

def loading_data(dataset):
 dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
 # #changing column header name
 dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
 #to change datatype
 dataset=dataset.drop('External IP')
 dataset = dataset.filter(dataset.Label.isNotNull())
 dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
 print(dataset.groupBy('Label').count().collect())
 return dataset

# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)

check type of dataset.

type(dataset)

pyspark.sql.dataframe.DataFrame

convert to np array

import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())

split features and labels

X = np_dfr[:,0:22]
Y = np_dfr[:,-1]

show X

>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
 ['50905', '0', '0', ..., '110', '0', '0'],
 ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
 ...,
 ['85', '0', '0', ..., '1980', '0', '0'],
 ['233', '54', '27', ..., '-1', '0', '0'],
 ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')

asked Mar 26 at 17:18

venom8914

1212

I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters.

array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
 ['50905', '0', '0', ..., '110', '0', '0'],
 ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
 ...,
 ['85', '0', '0', ..., '1980', '0', '0'],
 ['233', '54', '27', ..., '-1', '0', '0'],
 ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')

When I convert it to a float datatype, using

X_f = X.astype(float)

I get the error as shown above. how to solve this string formatting error for '��'?

I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.

My questions:-

How do I handle this misreading?

Should I ignore these characters? Or should I transform them to zero maybe?

Additional Information on how the data was read:-

importing relevant packages

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col

loading the dataset in a pyspark dataframe

def loading_data(dataset):
 dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
 # #changing column header name
 dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
 #to change datatype
 dataset=dataset.drop('External IP')
 dataset = dataset.filter(dataset.Label.isNotNull())
 dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
 print(dataset.groupBy('Label').count().collect())
 return dataset

# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)

check type of dataset.

type(dataset)

pyspark.sql.dataframe.DataFrame

convert to np array

import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())

split features and labels

X = np_dfr[:,0:22]
Y = np_dfr[:,-1]

show X

>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
 ['50905', '0', '0', ..., '110', '0', '0'],
 ['143899', '1325', '28.80434783', ..., '61', '0', '0'],
 ...,
 ['85', '0', '0', ..., '1980', '0', '0'],
 ['233', '54', '27', ..., '-1', '0', '0'],
 ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')

python dataframe csv data-formats

asked Mar 26 at 17:18

venom8914

1212

asked Mar 26 at 17:18

venom8914

1212

asked Mar 26 at 17:18

venom8914

1212

asked Mar 26 at 17:18

venom8914

1212

asked Mar 26 at 17:18

venom8914

1212

add a comment |

1 Answer
1

active

oldest

votes

Though not the best solution, I found some success by converting it into pandas dataframe and working along.

code snippet

# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0 
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)

input

array([['85', '0', '0', '1980', '0', '0'],
 ['233', '54', '27', '-1', '0', '0'],
 ['���', '�', '�����', '�', '��', '���']], dtype='<U5')

output

array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
 [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
 [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])

answered Mar 27 at 13:31

venom8914

1212

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48049%2fvalueerror-could-not-convert-string-to-float%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Though not the best solution, I found some success by converting it into pandas dataframe and working along.

code snippet

# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0 
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)

input

array([['85', '0', '0', '1980', '0', '0'],
 ['233', '54', '27', '-1', '0', '0'],
 ['���', '�', '�����', '�', '��', '���']], dtype='<U5')

output

array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
 [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
 [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])

answered Mar 27 at 13:31

venom8914

1212

New contributor

add a comment |

Though not the best solution, I found some success by converting it into pandas dataframe and working along.

code snippet

# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0 
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)

input

array([['85', '0', '0', '1980', '0', '0'],
 ['233', '54', '27', '-1', '0', '0'],
 ['���', '�', '�����', '�', '��', '���']], dtype='<U5')

output

array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
 [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
 [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])

answered Mar 27 at 13:31

venom8914

1212

New contributor

add a comment |

Though not the best solution, I found some success by converting it into pandas dataframe and working along.

code snippet

# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0 
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)

input

array([['85', '0', '0', '1980', '0', '0'],
 ['233', '54', '27', '-1', '0', '0'],
 ['���', '�', '�����', '�', '��', '���']], dtype='<U5')

output

array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
 [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
 [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])

answered Mar 27 at 13:31

venom8914

1212

New contributor

Though not the best solution, I found some success by converting it into pandas dataframe and working along.

code snippet

# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0 
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)

input

array([['85', '0', '0', '1980', '0', '0'],
 ['233', '54', '27', '-1', '0', '0'],
 ['���', '�', '�����', '�', '��', '���']], dtype='<U5')

output

array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
 [ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
 [ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])

answered Mar 27 at 13:31

venom8914

1212

New contributor

answered Mar 27 at 13:31

venom8914

1212

New contributor

answered Mar 27 at 13:31

venom8914

1212

answered Mar 27 at 13:31

venom8914

1212

New contributor

venom8914 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

uucMj44AlEbbHNWLH,DW,J,HxsC3HDTHqGSUjHg82JpjAWf8M

My questions:-

Additional Information on how the data was read:-

importing relevant packages

loading the dataset in a pyspark dataframe

check type of dataset.

convert to np array

split features and labels

show X

My questions:-

Additional Information on how the data was read:-

importing relevant packages

loading the dataset in a pyspark dataframe

check type of dataset.

convert to np array

split features and labels

show X

My questions:-

Additional Information on how the data was read:-

importing relevant packages

loading the dataset in a pyspark dataframe

check type of dataset.

convert to np array

split features and labels

show X

My questions:-

Additional Information on how the data was read:-

importing relevant packages

loading the dataset in a pyspark dataframe

check type of dataset.

convert to np array

split features and labels

show X

1 Answer 1

code snippet

input

output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

code snippet

input

output

code snippet

input

output

code snippet

input

output

code snippet

input

output

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1