Classify sensor data (multivariate time series) with Python's scikit-learn decision treeAre decision tree algorithms linear or nonlinearHow can I read in a .csv file with special characters in it in pandas?Prediction model for marketing to prospective customers (using pandas)Custom metrics for unbalanced classes problem in RandomForest or SVMCreate top 10 index fund based on >100 stocksScikit-learn decision tree in productionCosine similarity between two folders (1 and 2) with documents, and find the most relevant set of documents (in folder 2) for each doc (in folder 2)Accuracy differs between MATLAB and scikit-learn for a decision treeMerging dataframes in Pandas is taking a surprisingly long timeSciKit-Learn Decision Tree Overfitting

How would one muzzle a full grown polar bear in the 13th century?

How much cash can I safely carry into the USA and avoid civil forfeiture?

What is the relationship between spectral sequences and obstruction theory?

Why does processed meat contain preservatives, while canned fish needs not?

Fizzy, soft, pop and still drinks

Stop and Take a Breath!

Does holding a wand and speaking its command word count as V/S/M spell components?

What makes accurate emulation of old systems a difficult task?

How to verbalise code in Mathematica?

Is there any limitation with Arduino Nano serial communication distance?

simple conditions equation

How to pronounce 'C++' in Spanish

How can I practically buy stocks?

How to have a sharp product image?

how to find the equation of a circle given points of the circle

What happened to Captain America in Endgame?

Exchange,swap or switch

What does it mean to express a gate in Dirac notation?

Meaning of Bloch representation

Why was Germany not as successful as other Europeans in establishing overseas colonies?

Examples of non trivial equivalence relations , I mean equivalence relations without the expression " same ... as" in their definition?

Examples of subgroups where it's nontrivial to show closure under multiplication?

What is the difference between `command a[bc]d` and `command `ab,cd`

What is the strongest case that can be made in favour of the UK regaining some control over fishing policy after Brexit?

Classify sensor data (multivariate time series) with Python's scikit-learn decision tree

Are decision tree algorithms linear or nonlinearHow can I read in a .csv file with special characters in it in pandas?Prediction model for marketing to prospective customers (using pandas)Custom metrics for unbalanced classes problem in RandomForest or SVMCreate top 10 index fund based on >100 stocksScikit-learn decision tree in productionCosine similarity between two folders (1 and 2) with documents, and find the most relevant set of documents (in folder 2) for each doc (in folder 2)Accuracy differs between MATLAB and scikit-learn for a decision treeMerging dataframes in Pandas is taking a surprisingly long timeSciKit-Learn Decision Tree Overfitting

i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:

sensordata:

multiple .csv files

every .csv file has multiple sensors (see here)

each .csv file has one label (0 or 1)

So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?

asked Apr 3 at 15:41

Kev Schl

1

$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15

$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28

$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46

$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47

$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49

|
show 4 more comments

i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:

sensordata:

multiple .csv files

every .csv file has multiple sensors (see here)

each .csv file has one label (0 or 1)

asked Apr 3 at 15:41

Kev Schl

1

$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15

$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28

$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46

$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47

$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49

|
show 4 more comments

i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:

sensordata:

multiple .csv files

every .csv file has multiple sensors (see here)

each .csv file has one label (0 or 1)

asked Apr 3 at 15:41

Kev Schl

i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:

sensordata:

multiple .csv files

every .csv file has multiple sensors (see here)

each .csv file has one label (0 or 1)

machine-learning python scikit-learn decision-trees

asked Apr 3 at 15:41

Kev Schl

asked Apr 3 at 15:41

Kev Schl

asked Apr 3 at 15:41

Kev Schl

asked Apr 3 at 15:41

Kev Schl

asked Apr 3 at 15:41

Kev Schl

1

$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15

$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28

$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46

$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47

$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49

|
show 4 more comments

1

$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15

$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28

$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46

$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47

$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49

Try to include part of your code (related to the question) giving a chance to the community to help you

– Tasos
Apr 3 at 16:15

hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".

– Kev Schl
Apr 3 at 16:28

You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.

– jonnor
Apr 6 at 11:46

'train model with pandas Series' does not make any sense. There is no training functionality in pandas

– jonnor
Apr 6 at 11:47

Please also provide a CSV with your example data. Much easier to read and to show an example from.

– jonnor
Apr 6 at 11:49

|
show 4 more comments

1 Answer
1

active

oldest

votes

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.

What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.

Feature Engineering

The overall process is:

Look for patterns in the data (Exploratory Data Analysis)

Attempt to create a new feature which describes this pattern

Evaluate the new set of features using cross-validation

Analyze the samples that your classifier got wrong (Error Analysis)

Repeat from 1) until performance is good enough

Here are some things you should try:

Plot the raw sensor data from a few samples of the positive and negative class.

Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.

Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.

Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.

Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.

Fitting to classifier

import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

 timesteps = 10
 times = numpy.linspace(0.1, 1.0, timesteps)
 df = pandas.DataFrame(
 'time': times,
 'sensor1': numpy.random.random(timesteps),
 'sensor2': numpy.random.random(timesteps),
 'sensor3': numpy.random.random(timesteps),
 'sensor4': numpy.random.random(timesteps), 
 )

 return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
 # remove time column
 feature_columns = list(set(data.columns) - set(['time']))
 # TODO: do smarter feature engineering here
 sensor_values = data[feature_columns].values 
 # Note: the features must be 1D for scikit-learn classifiers
 features = sensor_values.flatten()
 assert len(features.shape) == 1, features.shape
 return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)

Example output

sample from CSV file:
 time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013 
label 1 

Features: (100, 40) 
 [0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
 0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
 0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
 0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
 0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
 0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
 0.24447754 0.5860489 0.37301339 0.30671624]

```

edited Apr 8 at 12:23

answered Apr 7 at 20:25

jonnor

2826

$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52

$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55

$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43

$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44

$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37

|
show 2 more comments

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48534%2fclassify-sensor-data-multivariate-time-series-with-pythons-scikit-learn-decis%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.

What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.

Feature Engineering

The overall process is:

Look for patterns in the data (Exploratory Data Analysis)

Attempt to create a new feature which describes this pattern

Evaluate the new set of features using cross-validation

Analyze the samples that your classifier got wrong (Error Analysis)

Repeat from 1) until performance is good enough

Here are some things you should try:

Plot the raw sensor data from a few samples of the positive and negative class.

Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.

Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.

Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.

Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.

Fitting to classifier

import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

 timesteps = 10
 times = numpy.linspace(0.1, 1.0, timesteps)
 df = pandas.DataFrame(
 'time': times,
 'sensor1': numpy.random.random(timesteps),
 'sensor2': numpy.random.random(timesteps),
 'sensor3': numpy.random.random(timesteps),
 'sensor4': numpy.random.random(timesteps), 
 )

 return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
 # remove time column
 feature_columns = list(set(data.columns) - set(['time']))
 # TODO: do smarter feature engineering here
 sensor_values = data[feature_columns].values 
 # Note: the features must be 1D for scikit-learn classifiers
 features = sensor_values.flatten()
 assert len(features.shape) == 1, features.shape
 return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)

Example output

sample from CSV file:
 time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013 
label 1 

Features: (100, 40) 
 [0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
 0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
 0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
 0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
 0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
 0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
 0.24447754 0.5860489 0.37301339 0.30671624]

```

edited Apr 8 at 12:23

answered Apr 7 at 20:25

jonnor

2826

$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52

$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55

$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43

$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44

$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37

|
show 2 more comments

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.

What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.

Feature Engineering

The overall process is:

Look for patterns in the data (Exploratory Data Analysis)

Attempt to create a new feature which describes this pattern

Evaluate the new set of features using cross-validation

Analyze the samples that your classifier got wrong (Error Analysis)

Repeat from 1) until performance is good enough

Here are some things you should try:

Plot the raw sensor data from a few samples of the positive and negative class.

Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.

Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.

Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.

Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.

Fitting to classifier

import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

 timesteps = 10
 times = numpy.linspace(0.1, 1.0, timesteps)
 df = pandas.DataFrame(
 'time': times,
 'sensor1': numpy.random.random(timesteps),
 'sensor2': numpy.random.random(timesteps),
 'sensor3': numpy.random.random(timesteps),
 'sensor4': numpy.random.random(timesteps), 
 )

 return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
 # remove time column
 feature_columns = list(set(data.columns) - set(['time']))
 # TODO: do smarter feature engineering here
 sensor_values = data[feature_columns].values 
 # Note: the features must be 1D for scikit-learn classifiers
 features = sensor_values.flatten()
 assert len(features.shape) == 1, features.shape
 return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)

Example output

sample from CSV file:
 time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013 
label 1 

Features: (100, 40) 
 [0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
 0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
 0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
 0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
 0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
 0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
 0.24447754 0.5860489 0.37301339 0.30671624]

```

edited Apr 8 at 12:23

answered Apr 7 at 20:25

jonnor

2826

$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52

$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55

$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43

$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44

$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37

|
show 2 more comments

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.

What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.

Feature Engineering

The overall process is:

Look for patterns in the data (Exploratory Data Analysis)

Attempt to create a new feature which describes this pattern

Evaluate the new set of features using cross-validation

Analyze the samples that your classifier got wrong (Error Analysis)

Repeat from 1) until performance is good enough

Here are some things you should try:

Plot the raw sensor data from a few samples of the positive and negative class.

Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.

Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.

Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.

Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.

Fitting to classifier

import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

 timesteps = 10
 times = numpy.linspace(0.1, 1.0, timesteps)
 df = pandas.DataFrame(
 'time': times,
 'sensor1': numpy.random.random(timesteps),
 'sensor2': numpy.random.random(timesteps),
 'sensor3': numpy.random.random(timesteps),
 'sensor4': numpy.random.random(timesteps), 
 )

 return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
 # remove time column
 feature_columns = list(set(data.columns) - set(['time']))
 # TODO: do smarter feature engineering here
 sensor_values = data[feature_columns].values 
 # Note: the features must be 1D for scikit-learn classifiers
 features = sensor_values.flatten()
 assert len(features.shape) == 1, features.shape
 return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)

Example output

sample from CSV file:
 time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013 
label 1 

Features: (100, 40) 
 [0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
 0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
 0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
 0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
 0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
 0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
 0.24447754 0.5860489 0.37301339 0.30671624]

```

edited Apr 8 at 12:23

answered Apr 7 at 20:25

jonnor

2826

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.

What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.

Feature Engineering

The overall process is:

Look for patterns in the data (Exploratory Data Analysis)

Attempt to create a new feature which describes this pattern

Evaluate the new set of features using cross-validation

Analyze the samples that your classifier got wrong (Error Analysis)

Repeat from 1) until performance is good enough

Here are some things you should try:

Plot the raw sensor data from a few samples of the positive and negative class.

Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.

Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.

Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.

Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.

Fitting to classifier

import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

 timesteps = 10
 times = numpy.linspace(0.1, 1.0, timesteps)
 df = pandas.DataFrame(
 'time': times,
 'sensor1': numpy.random.random(timesteps),
 'sensor2': numpy.random.random(timesteps),
 'sensor3': numpy.random.random(timesteps),
 'sensor4': numpy.random.random(timesteps), 
 )

 return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
 # remove time column
 feature_columns = list(set(data.columns) - set(['time']))
 # TODO: do smarter feature engineering here
 sensor_values = data[feature_columns].values 
 # Note: the features must be 1D for scikit-learn classifiers
 features = sensor_values.flatten()
 assert len(features.shape) == 1, features.shape
 return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)

Example output

sample from CSV file:
 time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013 
label 1 

Features: (100, 40) 
 [0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
 0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
 0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
 0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
 0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
 0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
 0.24447754 0.5860489 0.37301339 0.30671624]

```

edited Apr 8 at 12:23

answered Apr 7 at 20:25

jonnor

2826

edited Apr 8 at 12:23

answered Apr 7 at 20:25

jonnor

2826

answered Apr 7 at 20:25

jonnor

2826

answered Apr 7 at 20:25

jonnor

2826

$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52

$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55

$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43

$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44

$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37

|
show 2 more comments

$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52

$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55

$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43

$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44

$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37

thanks for your answer!

– Kev Schl
Apr 8 at 5:52

That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."

– Kev Schl
Apr 8 at 5:55

Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.

– jonnor
Apr 8 at 9:43

PS, always use RandomForest instead of DecisionTree, it performs much better

– jonnor
Apr 8 at 9:44

RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.

– Kev Schl
Apr 8 at 10:37

|
show 2 more comments

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Feature Engineering

Fitting to classifier

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Feature Engineering

Fitting to classifier

Feature Engineering

Fitting to classifier

Feature Engineering

Fitting to classifier

Feature Engineering

Fitting to classifier

Post as a guest

Popular posts from this blog

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Feature Engineering

Fitting to classifier

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Feature Engineering

Fitting to classifier

Feature Engineering

Fitting to classifier

Feature Engineering

Fitting to classifier

Feature Engineering

Fitting to classifier

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1