Classify sensor data (multivariate time series) with Python's scikit-learn decision treeAre decision tree algorithms linear or nonlinearHow can I read in a .csv file with special characters in it in pandas?Prediction model for marketing to prospective customers (using pandas)Custom metrics for unbalanced classes problem in RandomForest or SVMCreate top 10 index fund based on >100 stocksScikit-learn decision tree in productionCosine similarity between two folders (1 and 2) with documents, and find the most relevant set of documents (in folder 2) for each doc (in folder 2)Accuracy differs between MATLAB and scikit-learn for a decision treeMerging dataframes in Pandas is taking a surprisingly long timeSciKit-Learn Decision Tree Overfitting

How would one muzzle a full grown polar bear in the 13th century?

How much cash can I safely carry into the USA and avoid civil forfeiture?

What is the relationship between spectral sequences and obstruction theory?

Why does processed meat contain preservatives, while canned fish needs not?

Fizzy, soft, pop and still drinks

Stop and Take a Breath!

Does holding a wand and speaking its command word count as V/S/M spell components?

What makes accurate emulation of old systems a difficult task?

How to verbalise code in Mathematica?

Is there any limitation with Arduino Nano serial communication distance?

simple conditions equation

How to pronounce 'C++' in Spanish

How can I practically buy stocks?

How to have a sharp product image?

how to find the equation of a circle given points of the circle

What happened to Captain America in Endgame?

Exchange,swap or switch

What does it mean to express a gate in Dirac notation?

Meaning of Bloch representation

Why was Germany not as successful as other Europeans in establishing overseas colonies?

Examples of non trivial equivalence relations , I mean equivalence relations without the expression " same ... as" in their definition?

Examples of subgroups where it's nontrivial to show closure under multiplication?

What is the difference between `command a[bc]d` and `command `ab,cd`

What is the strongest case that can be made in favour of the UK regaining some control over fishing policy after Brexit?



Classify sensor data (multivariate time series) with Python's scikit-learn decision tree


Are decision tree algorithms linear or nonlinearHow can I read in a .csv file with special characters in it in pandas?Prediction model for marketing to prospective customers (using pandas)Custom metrics for unbalanced classes problem in RandomForest or SVMCreate top 10 index fund based on >100 stocksScikit-learn decision tree in productionCosine similarity between two folders (1 and 2) with documents, and find the most relevant set of documents (in folder 2) for each doc (in folder 2)Accuracy differs between MATLAB and scikit-learn for a decision treeMerging dataframes in Pandas is taking a surprisingly long timeSciKit-Learn Decision Tree Overfitting













0












$begingroup$


i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:



sensordata:



  • multiple .csv files

  • every .csv file has multiple sensors (see here)

  • each .csv file has one label (0 or 1)

So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?










share|improve this question









$endgroup$







  • 1




    $begingroup$
    Try to include part of your code (related to the question) giving a chance to the community to help you
    $endgroup$
    – Tasos
    Apr 3 at 16:15










  • $begingroup$
    hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
    $endgroup$
    – Kev Schl
    Apr 3 at 16:28










  • $begingroup$
    You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
    $endgroup$
    – jonnor
    Apr 6 at 11:46










  • $begingroup$
    'train model with pandas Series' does not make any sense. There is no training functionality in pandas
    $endgroup$
    – jonnor
    Apr 6 at 11:47











  • $begingroup$
    Please also provide a CSV with your example data. Much easier to read and to show an example from.
    $endgroup$
    – jonnor
    Apr 6 at 11:49















0












$begingroup$


i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:



sensordata:



  • multiple .csv files

  • every .csv file has multiple sensors (see here)

  • each .csv file has one label (0 or 1)

So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?










share|improve this question









$endgroup$







  • 1




    $begingroup$
    Try to include part of your code (related to the question) giving a chance to the community to help you
    $endgroup$
    – Tasos
    Apr 3 at 16:15










  • $begingroup$
    hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
    $endgroup$
    – Kev Schl
    Apr 3 at 16:28










  • $begingroup$
    You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
    $endgroup$
    – jonnor
    Apr 6 at 11:46










  • $begingroup$
    'train model with pandas Series' does not make any sense. There is no training functionality in pandas
    $endgroup$
    – jonnor
    Apr 6 at 11:47











  • $begingroup$
    Please also provide a CSV with your example data. Much easier to read and to show an example from.
    $endgroup$
    – jonnor
    Apr 6 at 11:49













0












0








0





$begingroup$


i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:



sensordata:



  • multiple .csv files

  • every .csv file has multiple sensors (see here)

  • each .csv file has one label (0 or 1)

So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?










share|improve this question









$endgroup$




i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:



sensordata:



  • multiple .csv files

  • every .csv file has multiple sensors (see here)

  • each .csv file has one label (0 or 1)

So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?







machine-learning python scikit-learn decision-trees






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 3 at 15:41









Kev SchlKev Schl

1




1







  • 1




    $begingroup$
    Try to include part of your code (related to the question) giving a chance to the community to help you
    $endgroup$
    – Tasos
    Apr 3 at 16:15










  • $begingroup$
    hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
    $endgroup$
    – Kev Schl
    Apr 3 at 16:28










  • $begingroup$
    You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
    $endgroup$
    – jonnor
    Apr 6 at 11:46










  • $begingroup$
    'train model with pandas Series' does not make any sense. There is no training functionality in pandas
    $endgroup$
    – jonnor
    Apr 6 at 11:47











  • $begingroup$
    Please also provide a CSV with your example data. Much easier to read and to show an example from.
    $endgroup$
    – jonnor
    Apr 6 at 11:49












  • 1




    $begingroup$
    Try to include part of your code (related to the question) giving a chance to the community to help you
    $endgroup$
    – Tasos
    Apr 3 at 16:15










  • $begingroup$
    hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
    $endgroup$
    – Kev Schl
    Apr 3 at 16:28










  • $begingroup$
    You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
    $endgroup$
    – jonnor
    Apr 6 at 11:46










  • $begingroup$
    'train model with pandas Series' does not make any sense. There is no training functionality in pandas
    $endgroup$
    – jonnor
    Apr 6 at 11:47











  • $begingroup$
    Please also provide a CSV with your example data. Much easier to read and to show an example from.
    $endgroup$
    – jonnor
    Apr 6 at 11:49







1




1




$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15




$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15












$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28




$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28












$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46




$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46












$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47





$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47













$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49




$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49










1 Answer
1






active

oldest

votes


















0












$begingroup$

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.



What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.



Feature Engineering



The overall process is:



  1. Look for patterns in the data (Exploratory Data Analysis)

  2. Attempt to create a new feature which describes this pattern

  3. Evaluate the new set of features using cross-validation

  4. Analyze the samples that your classifier got wrong (Error Analysis)

  5. Repeat from 1) until performance is good enough

Here are some things you should try:



  • Plot the raw sensor data from a few samples of the positive and negative class.


  • Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.


  • Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.


  • Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.


Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.



Fitting to classifier



import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)

return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)


Example output



sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1

Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]

```





share|improve this answer











$endgroup$












  • $begingroup$
    thanks for your answer!
    $endgroup$
    – Kev Schl
    Apr 8 at 5:52










  • $begingroup$
    That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
    $endgroup$
    – Kev Schl
    Apr 8 at 5:55










  • $begingroup$
    Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
    $endgroup$
    – jonnor
    Apr 8 at 9:43











  • $begingroup$
    PS, always use RandomForest instead of DecisionTree, it performs much better
    $endgroup$
    – jonnor
    Apr 8 at 9:44










  • $begingroup$
    RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
    $endgroup$
    – Kev Schl
    Apr 8 at 10:37











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48534%2fclassify-sensor-data-multivariate-time-series-with-pythons-scikit-learn-decis%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.



What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.



Feature Engineering



The overall process is:



  1. Look for patterns in the data (Exploratory Data Analysis)

  2. Attempt to create a new feature which describes this pattern

  3. Evaluate the new set of features using cross-validation

  4. Analyze the samples that your classifier got wrong (Error Analysis)

  5. Repeat from 1) until performance is good enough

Here are some things you should try:



  • Plot the raw sensor data from a few samples of the positive and negative class.


  • Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.


  • Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.


  • Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.


Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.



Fitting to classifier



import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)

return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)


Example output



sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1

Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]

```





share|improve this answer











$endgroup$












  • $begingroup$
    thanks for your answer!
    $endgroup$
    – Kev Schl
    Apr 8 at 5:52










  • $begingroup$
    That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
    $endgroup$
    – Kev Schl
    Apr 8 at 5:55










  • $begingroup$
    Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
    $endgroup$
    – jonnor
    Apr 8 at 9:43











  • $begingroup$
    PS, always use RandomForest instead of DecisionTree, it performs much better
    $endgroup$
    – jonnor
    Apr 8 at 9:44










  • $begingroup$
    RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
    $endgroup$
    – Kev Schl
    Apr 8 at 10:37















0












$begingroup$

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.



What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.



Feature Engineering



The overall process is:



  1. Look for patterns in the data (Exploratory Data Analysis)

  2. Attempt to create a new feature which describes this pattern

  3. Evaluate the new set of features using cross-validation

  4. Analyze the samples that your classifier got wrong (Error Analysis)

  5. Repeat from 1) until performance is good enough

Here are some things you should try:



  • Plot the raw sensor data from a few samples of the positive and negative class.


  • Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.


  • Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.


  • Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.


Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.



Fitting to classifier



import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)

return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)


Example output



sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1

Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]

```





share|improve this answer











$endgroup$












  • $begingroup$
    thanks for your answer!
    $endgroup$
    – Kev Schl
    Apr 8 at 5:52










  • $begingroup$
    That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
    $endgroup$
    – Kev Schl
    Apr 8 at 5:55










  • $begingroup$
    Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
    $endgroup$
    – jonnor
    Apr 8 at 9:43











  • $begingroup$
    PS, always use RandomForest instead of DecisionTree, it performs much better
    $endgroup$
    – jonnor
    Apr 8 at 9:44










  • $begingroup$
    RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
    $endgroup$
    – Kev Schl
    Apr 8 at 10:37













0












0








0





$begingroup$

For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.



What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.



Feature Engineering



The overall process is:



  1. Look for patterns in the data (Exploratory Data Analysis)

  2. Attempt to create a new feature which describes this pattern

  3. Evaluate the new set of features using cross-validation

  4. Analyze the samples that your classifier got wrong (Error Analysis)

  5. Repeat from 1) until performance is good enough

Here are some things you should try:



  • Plot the raw sensor data from a few samples of the positive and negative class.


  • Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.


  • Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.


  • Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.


Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.



Fitting to classifier



import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)

return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)


Example output



sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1

Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]

```





share|improve this answer











$endgroup$



For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.



What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.



Feature Engineering



The overall process is:



  1. Look for patterns in the data (Exploratory Data Analysis)

  2. Attempt to create a new feature which describes this pattern

  3. Evaluate the new set of features using cross-validation

  4. Analyze the samples that your classifier got wrong (Error Analysis)

  5. Repeat from 1) until performance is good enough

Here are some things you should try:



  • Plot the raw sensor data from a few samples of the positive and negative class.


  • Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.


  • Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.


  • Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.


Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.



Fitting to classifier



import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier

def get_sensor_data():

timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)

return df

samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)

print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')

def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features

features = numpy.stack([ to_features(d) for d in samples ])

assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])

# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)


Example output



sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1

Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]

```






share|improve this answer














share|improve this answer



share|improve this answer








edited Apr 8 at 12:23

























answered Apr 7 at 20:25









jonnorjonnor

2826




2826











  • $begingroup$
    thanks for your answer!
    $endgroup$
    – Kev Schl
    Apr 8 at 5:52










  • $begingroup$
    That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
    $endgroup$
    – Kev Schl
    Apr 8 at 5:55










  • $begingroup$
    Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
    $endgroup$
    – jonnor
    Apr 8 at 9:43











  • $begingroup$
    PS, always use RandomForest instead of DecisionTree, it performs much better
    $endgroup$
    – jonnor
    Apr 8 at 9:44










  • $begingroup$
    RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
    $endgroup$
    – Kev Schl
    Apr 8 at 10:37
















  • $begingroup$
    thanks for your answer!
    $endgroup$
    – Kev Schl
    Apr 8 at 5:52










  • $begingroup$
    That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
    $endgroup$
    – Kev Schl
    Apr 8 at 5:55










  • $begingroup$
    Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
    $endgroup$
    – jonnor
    Apr 8 at 9:43











  • $begingroup$
    PS, always use RandomForest instead of DecisionTree, it performs much better
    $endgroup$
    – jonnor
    Apr 8 at 9:44










  • $begingroup$
    RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
    $endgroup$
    – Kev Schl
    Apr 8 at 10:37















$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52




$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52












$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55




$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55












$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43





$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43













$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44




$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44












$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37




$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48534%2fclassify-sensor-data-multivariate-time-series-with-pythons-scikit-learn-decis%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli