ValueError: could not convert string to float: '���'2019 Community Moderator ElectionBatch convert json links to csvCould not convert string to float error on KDDCup99 datasetGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Tensorflow: can not convert float into a tensor?Convert List to DataFrameValueError: operands could not be broadcast together with shapes while using two sample independent t testValueError: operands could not be broadcast together with shapes (60002,39) (38,) during pca.transformTypeError: float() argument must be a string or a number, not 'function'ValueError: operands could not be broadcast together with shapes (140,) (10230,)Inputting (a lot of )data into a dataframe one row at a time
Processor speed limited at 0.4 GHz
Am I breaking OOP practice with this architecture?
Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?
Meaning of "Around In"
What historical events would have to change in order to make 19th century "steampunk" technology possible?
Why is the sentence "Das ist eine Nase" correct?
Array of objects return object when condition matched
How can saying a song's name be a copyright violation?
Why do I get negative height?
How do conventional missiles fly?
How to show a landlord what we have in savings?
Amending the P2P Layer
Check for characters in a string being unique
Is it possible to create a QR code using text?
Why were 5.25" floppy drives cheaper than 8"?
How badly should I try to prevent a user from XSSing themselves?
Could the museum Saturn V's be refitted for one more flight?
Do Iron Man suits sport waste management systems?
Using "tail" to follow a file without displaying the most recent lines
How can I deal with my CEO asking me to hire someone with a higher salary than me, a co-founder?
How to properly check if the given string is empty in a POSIX shell script?
What is the fastest integer factorization to break RSA?
Venezuelan girlfriend wants to travel the USA to be with me. What is the process?
In Bayesian inference, why are some terms dropped from the posterior predictive?
ValueError: could not convert string to float: '���'
2019 Community Moderator ElectionBatch convert json links to csvCould not convert string to float error on KDDCup99 datasetGausianNB: Could not convert string to float: 'Thu Apr 16 23:58:58 2015'Tensorflow: can not convert float into a tensor?Convert List to DataFrameValueError: operands could not be broadcast together with shapes while using two sample independent t testValueError: operands could not be broadcast together with shapes (60002,39) (38,) during pca.transformTypeError: float() argument must be a string or a number, not 'function'ValueError: operands could not be broadcast together with shapes (140,) (10230,)Inputting (a lot of )data into a dataframe one row at a time
$begingroup$
I have a (2M, 23) dimensional numpy
array X
. It has a dtype of <U26
, i.e. unicode string of 26 characters.
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')
When I convert it to a float datatype, using
X_f = X.astype(float)
I get the error as shown above. how to solve this string formatting error for '���'?
I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.
My questions:-
- How do I handle this misreading?
- Should I ignore these characters? Or should I transform them to zero maybe?
Additional Information on how the data was read:-
importing relevant packages
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
loading the dataset in a pyspark dataframe
def loading_data(dataset):
dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
# #changing column header name
dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
#to change datatype
dataset=dataset.drop('External IP')
dataset = dataset.filter(dataset.Label.isNotNull())
dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
print(dataset.groupBy('Label').count().collect())
return dataset
# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)
check type of dataset.
type(dataset)
pyspark.sql.dataframe.DataFrame
convert to np array
import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())
split features and labels
X = np_dfr[:,0:22]
Y = np_dfr[:,-1]
show X
>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')
python dataframe csv data-formats
$endgroup$
add a comment |
$begingroup$
I have a (2M, 23) dimensional numpy
array X
. It has a dtype of <U26
, i.e. unicode string of 26 characters.
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')
When I convert it to a float datatype, using
X_f = X.astype(float)
I get the error as shown above. how to solve this string formatting error for '���'?
I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.
My questions:-
- How do I handle this misreading?
- Should I ignore these characters? Or should I transform them to zero maybe?
Additional Information on how the data was read:-
importing relevant packages
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
loading the dataset in a pyspark dataframe
def loading_data(dataset):
dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
# #changing column header name
dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
#to change datatype
dataset=dataset.drop('External IP')
dataset = dataset.filter(dataset.Label.isNotNull())
dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
print(dataset.groupBy('Label').count().collect())
return dataset
# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)
check type of dataset.
type(dataset)
pyspark.sql.dataframe.DataFrame
convert to np array
import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())
split features and labels
X = np_dfr[:,0:22]
Y = np_dfr[:,-1]
show X
>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')
python dataframe csv data-formats
$endgroup$
add a comment |
$begingroup$
I have a (2M, 23) dimensional numpy
array X
. It has a dtype of <U26
, i.e. unicode string of 26 characters.
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')
When I convert it to a float datatype, using
X_f = X.astype(float)
I get the error as shown above. how to solve this string formatting error for '���'?
I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.
My questions:-
- How do I handle this misreading?
- Should I ignore these characters? Or should I transform them to zero maybe?
Additional Information on how the data was read:-
importing relevant packages
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
loading the dataset in a pyspark dataframe
def loading_data(dataset):
dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
# #changing column header name
dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
#to change datatype
dataset=dataset.drop('External IP')
dataset = dataset.filter(dataset.Label.isNotNull())
dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
print(dataset.groupBy('Label').count().collect())
return dataset
# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)
check type of dataset.
type(dataset)
pyspark.sql.dataframe.DataFrame
convert to np array
import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())
split features and labels
X = np_dfr[:,0:22]
Y = np_dfr[:,-1]
show X
>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')
python dataframe csv data-formats
$endgroup$
I have a (2M, 23) dimensional numpy
array X
. It has a dtype of <U26
, i.e. unicode string of 26 characters.
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')
When I convert it to a float datatype, using
X_f = X.astype(float)
I get the error as shown above. how to solve this string formatting error for '���'?
I realize that some characters are not read properly in the dataframe, and the unicode replacement character is just a result of it.
My questions:-
- How do I handle this misreading?
- Should I ignore these characters? Or should I transform them to zero maybe?
Additional Information on how the data was read:-
importing relevant packages
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import col
loading the dataset in a pyspark dataframe
def loading_data(dataset):
dataset=sql_sc.read.format('csv').options(header='true', inferSchema='true').load(dataset)
# #changing column header name
dataset = dataset.select(*[col(s).alias('Label') if s == ' Label' else s for s in dataset.columns])
#to change datatype
dataset=dataset.drop('External IP')
dataset = dataset.filter(dataset.Label.isNotNull())
dataset=dataset.filter(dataset.Label!=' Label')#filter Label from label
print(dataset.groupBy('Label').count().collect())
return dataset
# invoking
ds_path = '../final.csv'
dataset=loading_data(ds_path)
check type of dataset.
type(dataset)
pyspark.sql.dataframe.DataFrame
convert to np array
import numpy as np
np_dfr = np.array(data_preprocessing(dataset).collect())
split features and labels
X = np_dfr[:,0:22]
Y = np_dfr[:,-1]
show X
>> X
array([['143347', '1325', '28.19148936', ..., '61', '0', '0'],
['50905', '0', '0', ..., '110', '0', '0'],
['143899', '1325', '28.80434783', ..., '61', '0', '0'],
...,
['85', '0', '0', ..., '1980', '0', '0'],
['233', '54', '27', ..., '-1', '0', '0'],
['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26')
python dataframe csv data-formats
python dataframe csv data-formats
asked Mar 26 at 17:18
venom8914venom8914
1212
1212
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Though not the best solution, I found some success by converting it into pandas dataframe and working along.
code snippet
# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)
input
array([['85', '0', '0', '1980', '0', '0'],
['233', '54', '27', '-1', '0', '0'],
['���', '�', '�����', '�', '��', '���']], dtype='<U5')
output
array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
[ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
[ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48049%2fvalueerror-could-not-convert-string-to-float%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Though not the best solution, I found some success by converting it into pandas dataframe and working along.
code snippet
# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)
input
array([['85', '0', '0', '1980', '0', '0'],
['233', '54', '27', '-1', '0', '0'],
['���', '�', '�����', '�', '��', '���']], dtype='<U5')
output
array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
[ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
[ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])
New contributor
$endgroup$
add a comment |
$begingroup$
Though not the best solution, I found some success by converting it into pandas dataframe and working along.
code snippet
# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)
input
array([['85', '0', '0', '1980', '0', '0'],
['233', '54', '27', '-1', '0', '0'],
['���', '�', '�����', '�', '��', '���']], dtype='<U5')
output
array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
[ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
[ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])
New contributor
$endgroup$
add a comment |
$begingroup$
Though not the best solution, I found some success by converting it into pandas dataframe and working along.
code snippet
# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)
input
array([['85', '0', '0', '1980', '0', '0'],
['233', '54', '27', '-1', '0', '0'],
['���', '�', '�����', '�', '��', '���']], dtype='<U5')
output
array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
[ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
[ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])
New contributor
$endgroup$
Though not the best solution, I found some success by converting it into pandas dataframe and working along.
code snippet
# convert X into dataframe
X_pd = pd.DataFrame(data=X)
# replace all instances of URC with 0
X_replace = X_pd.replace('�',0, regex=True)
# convert it back to numpy array
X_np = X_replace.values
# set the object type as float
X_fa = X_np.astype(float)
input
array([['85', '0', '0', '1980', '0', '0'],
['233', '54', '27', '-1', '0', '0'],
['���', '�', '�����', '�', '��', '���']], dtype='<U5')
output
array([[ 8.50e+01, 0.00e+00, 0.00e+00, 1.98e+03, 0.00e+00, 0.00e+00],
[ 2.33e+02, 5.40e+01, 2.70e+01, -1.00e+00, 0.00e+00, 0.00e+00],
[ 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00]])
New contributor
New contributor
answered Mar 27 at 13:31
venom8914venom8914
1212
1212
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48049%2fvalueerror-could-not-convert-string-to-float%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown