Classify sensor data (multivariate time series) with Python's scikit-learn decision treeAre decision tree algorithms linear or nonlinearHow can I read in a .csv file with special characters in it in pandas?Prediction model for marketing to prospective customers (using pandas)Custom metrics for unbalanced classes problem in RandomForest or SVMCreate top 10 index fund based on >100 stocksScikit-learn decision tree in productionCosine similarity between two folders (1 and 2) with documents, and find the most relevant set of documents (in folder 2) for each doc (in folder 2)Accuracy differs between MATLAB and scikit-learn for a decision treeMerging dataframes in Pandas is taking a surprisingly long timeSciKit-Learn Decision Tree Overfitting
How would one muzzle a full grown polar bear in the 13th century?
How much cash can I safely carry into the USA and avoid civil forfeiture?
What is the relationship between spectral sequences and obstruction theory?
Why does processed meat contain preservatives, while canned fish needs not?
Fizzy, soft, pop and still drinks
Stop and Take a Breath!
Does holding a wand and speaking its command word count as V/S/M spell components?
What makes accurate emulation of old systems a difficult task?
How to verbalise code in Mathematica?
Is there any limitation with Arduino Nano serial communication distance?
simple conditions equation
How to pronounce 'C++' in Spanish
How can I practically buy stocks?
How to have a sharp product image?
how to find the equation of a circle given points of the circle
What happened to Captain America in Endgame?
Exchange,swap or switch
What does it mean to express a gate in Dirac notation?
Meaning of Bloch representation
Why was Germany not as successful as other Europeans in establishing overseas colonies?
Examples of non trivial equivalence relations , I mean equivalence relations without the expression " same ... as" in their definition?
Examples of subgroups where it's nontrivial to show closure under multiplication?
What is the difference between `command a[bc]d` and `command `ab,cd`
What is the strongest case that can be made in favour of the UK regaining some control over fishing policy after Brexit?
Classify sensor data (multivariate time series) with Python's scikit-learn decision tree
Are decision tree algorithms linear or nonlinearHow can I read in a .csv file with special characters in it in pandas?Prediction model for marketing to prospective customers (using pandas)Custom metrics for unbalanced classes problem in RandomForest or SVMCreate top 10 index fund based on >100 stocksScikit-learn decision tree in productionCosine similarity between two folders (1 and 2) with documents, and find the most relevant set of documents (in folder 2) for each doc (in folder 2)Accuracy differs between MATLAB and scikit-learn for a decision treeMerging dataframes in Pandas is taking a surprisingly long timeSciKit-Learn Decision Tree Overfitting
$begingroup$
i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:
sensordata:
- multiple .csv files
- every .csv file has multiple sensors (see here)
- each .csv file has one label (0 or 1)
So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?
machine-learning python scikit-learn decision-trees
$endgroup$
|
show 4 more comments
$begingroup$
i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:
sensordata:
- multiple .csv files
- every .csv file has multiple sensors (see here)
- each .csv file has one label (0 or 1)
So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?
machine-learning python scikit-learn decision-trees
$endgroup$
1
$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15
$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28
$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46
$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47
$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49
|
show 4 more comments
$begingroup$
i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:
sensordata:
- multiple .csv files
- every .csv file has multiple sensors (see here)
- each .csv file has one label (0 or 1)
So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?
machine-learning python scikit-learn decision-trees
$endgroup$
i'm trying to apply scikit learns decision tree on the following dataset with the goal of classifying the data:
sensordata:
- multiple .csv files
- every .csv file has multiple sensors (see here)
- each .csv file has one label (0 or 1)
So far I've tried to train my model with Pandas Series. It worked, but the decision tree couldn't differate the features/sensors. Is pandas series the right approach for analyse data like this? Or does anyone have another solution for this problem?
machine-learning python scikit-learn decision-trees
machine-learning python scikit-learn decision-trees
asked Apr 3 at 15:41
Kev SchlKev Schl
1
1
1
$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15
$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28
$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46
$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47
$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49
|
show 4 more comments
1
$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15
$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28
$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46
$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47
$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49
1
1
$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15
$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15
$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28
$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28
$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46
$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46
$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47
$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47
$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49
$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49
|
show 4 more comments
1 Answer
1
active
oldest
votes
$begingroup$
For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.
What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.
Feature Engineering
The overall process is:
- Look for patterns in the data (Exploratory Data Analysis)
- Attempt to create a new feature which describes this pattern
- Evaluate the new set of features using cross-validation
- Analyze the samples that your classifier got wrong (Error Analysis)
- Repeat from 1) until performance is good enough
Here are some things you should try:
Plot the raw sensor data from a few samples of the positive and negative class.
Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.
Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.
Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.
Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.
Fitting to classifier
import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier
def get_sensor_data():
timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)
return df
samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)
print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')
def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features
features = numpy.stack([ to_features(d) for d in samples ])
assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])
# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)
Example output
sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1
Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]
```
$endgroup$
$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52
$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55
$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43
$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44
$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37
|
show 2 more comments
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48534%2fclassify-sensor-data-multivariate-time-series-with-pythons-scikit-learn-decis%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.
What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.
Feature Engineering
The overall process is:
- Look for patterns in the data (Exploratory Data Analysis)
- Attempt to create a new feature which describes this pattern
- Evaluate the new set of features using cross-validation
- Analyze the samples that your classifier got wrong (Error Analysis)
- Repeat from 1) until performance is good enough
Here are some things you should try:
Plot the raw sensor data from a few samples of the positive and negative class.
Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.
Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.
Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.
Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.
Fitting to classifier
import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier
def get_sensor_data():
timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)
return df
samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)
print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')
def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features
features = numpy.stack([ to_features(d) for d in samples ])
assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])
# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)
Example output
sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1
Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]
```
$endgroup$
$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52
$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55
$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43
$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44
$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37
|
show 2 more comments
$begingroup$
For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.
What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.
Feature Engineering
The overall process is:
- Look for patterns in the data (Exploratory Data Analysis)
- Attempt to create a new feature which describes this pattern
- Evaluate the new set of features using cross-validation
- Analyze the samples that your classifier got wrong (Error Analysis)
- Repeat from 1) until performance is good enough
Here are some things you should try:
Plot the raw sensor data from a few samples of the positive and negative class.
Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.
Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.
Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.
Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.
Fitting to classifier
import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier
def get_sensor_data():
timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)
return df
samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)
print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')
def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features
features = numpy.stack([ to_features(d) for d in samples ])
assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])
# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)
Example output
sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1
Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]
```
$endgroup$
$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52
$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55
$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43
$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44
$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37
|
show 2 more comments
$begingroup$
For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.
What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.
Feature Engineering
The overall process is:
- Look for patterns in the data (Exploratory Data Analysis)
- Attempt to create a new feature which describes this pattern
- Evaluate the new set of features using cross-validation
- Analyze the samples that your classifier got wrong (Error Analysis)
- Repeat from 1) until performance is good enough
Here are some things you should try:
Plot the raw sensor data from a few samples of the positive and negative class.
Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.
Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.
Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.
Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.
Fitting to classifier
import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier
def get_sensor_data():
timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)
return df
samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)
print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')
def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features
features = numpy.stack([ to_features(d) for d in samples ])
assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])
# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)
Example output
sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1
Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]
```
$endgroup$
For usage you need to flatten the 2D raw sensor data into 1D features. Below code demonstrates the basics.
What kind of feature engineering to apply for best predictive effect depends entirely on the nature of your sensors and problem. There are no details about this in the question or data provided.
Feature Engineering
The overall process is:
- Look for patterns in the data (Exploratory Data Analysis)
- Attempt to create a new feature which describes this pattern
- Evaluate the new set of features using cross-validation
- Analyze the samples that your classifier got wrong (Error Analysis)
- Repeat from 1) until performance is good enough
Here are some things you should try:
Plot the raw sensor data from a few samples of the positive and negative class.
Plot the distributions (histogram) for each class of each raw sensor values across the entire dataset.
Try to standardize the data. For each time-series of sensor data, remove the mean and divide by the standard deviation for each sample.
Try some standard statistical summarizations on each time-series. Max, min, mean, std, skew, kurtosis. Unlikely to be better than something tailored to the patterns you see, but sometimes performs OK.
Focus first on uni-variate features per sensor. The decision-tree will be good at combining these together.
Fitting to classifier
import numpy
import pandas
from sklearn.ensemble import RandomForestClassifier
def get_sensor_data():
timesteps = 10
times = numpy.linspace(0.1, 1.0, timesteps)
df = pandas.DataFrame(
'time': times,
'sensor1': numpy.random.random(timesteps),
'sensor2': numpy.random.random(timesteps),
'sensor3': numpy.random.random(timesteps),
'sensor4': numpy.random.random(timesteps),
)
return df
samples = [ get_sensor_data() for _ in range(100) ]
labels = [ int(numpy.random.random() > 0.5) for _ in range(100) ]
assert len(samples) == len(labels)
print('sample from CSV file:n', samples[0], 'nlabel', labels[0], 'n')
def to_features(data):
# remove time column
feature_columns = list(set(data.columns) - set(['time']))
# TODO: do smarter feature engineering here
sensor_values = data[feature_columns].values
# Note: the features must be 1D for scikit-learn classifiers
features = sensor_values.flatten()
assert len(features.shape) == 1, features.shape
return features
features = numpy.stack([ to_features(d) for d in samples ])
assert features.shape[0] == len(samples)
print('Features:', features.shape, 'n', features[0])
# XXX: do train/test splits etc
est = RandomForestClassifier(n_estimators=10, min_samples_leaf=0.01)
est.fit(features, labels)
Example output
sample from CSV file:
time sensor1 sensor2 sensor3 sensor4
0 0.1 0.820667 0.346542 0.625512 0.774050
1 0.2 0.821934 0.241652 0.485608 0.188131
2 0.3 0.264697 0.780841 0.137018 0.117096
3 0.4 0.464143 0.457126 0.972894 0.600710
4 0.5 0.530302 0.027401 0.876191 0.563788
5 0.6 0.598231 0.291814 0.588032 0.143753
6 0.7 0.627435 0.036549 0.276131 0.311099
7 0.8 0.527908 0.197046 0.580293 0.123796
8 0.9 0.068682 0.880533 0.956394 0.787993
9 1.0 0.244478 0.306716 0.586049 0.373013
label 1
Features: (100, 40)
[0.82066682 0.62551234 0.77405 0.34654243 0.82193414 0.48560828
0.18813108 0.24165186 0.26469686 0.1370181 0.11709553 0.78084136
0.46414318 0.97289382 0.60070974 0.45712632 0.53030219 0.8761905
0.5637877 0.02740072 0.59823073 0.58803188 0.14375282 0.29181434
0.62743516 0.27613083 0.31109894 0.03654882 0.52790773 0.58029298
0.1237963 0.19704597 0.06868206 0.95639405 0.78799333 0.88053276
0.24447754 0.5860489 0.37301339 0.30671624]
```
edited Apr 8 at 12:23
answered Apr 7 at 20:25
jonnorjonnor
2826
2826
$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52
$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55
$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43
$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44
$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37
|
show 2 more comments
$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52
$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55
$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43
$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44
$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37
$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52
$begingroup$
thanks for your answer!
$endgroup$
– Kev Schl
Apr 8 at 5:52
$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55
$begingroup$
That's a similiar approach to the one i had with using pandas Series. I've converted my DataFrames in 1D too. Is the decision tree able to differate the features tho? As I have mentioned in my question, that's a problem I'm not able to solve right now - "...but the decision tree couldn't differate the features/sensors..."
$endgroup$
– Kev Schl
Apr 8 at 5:55
$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43
$begingroup$
Whether a decision tree can predict what you want is not a question we can answer without more details. Open a new question with: 1) description of the sensors and the target fenomen 2) plot of the sensor data from positive class and from negative class (if a human cannot tell difference then machines usually cannot either) 3) plots of the test and training score.
$endgroup$
– jonnor
Apr 8 at 9:43
$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44
$begingroup$
PS, always use RandomForest instead of DecisionTree, it performs much better
$endgroup$
– jonnor
Apr 8 at 9:44
$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37
$begingroup$
RandomForest isn't that good for my sort of problem, because of its "black box" method. I want to understand and reproduce the decisions of the decision tree. My question was and still is how to classify multivariate time series on the basis of the sensors.
$endgroup$
– Kev Schl
Apr 8 at 10:37
|
show 2 more comments
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48534%2fclassify-sensor-data-multivariate-time-series-with-pythons-scikit-learn-decis%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Try to include part of your code (related to the question) giving a chance to the community to help you
$endgroup$
– Tasos
Apr 3 at 16:15
$begingroup$
hi, i think code isn't that relevant at this point. My question is more general like "how to handle this kind of data".
$endgroup$
– Kev Schl
Apr 3 at 16:28
$begingroup$
You need to show the code so we can show where things went wrong, or possibly provide updated code to illustrate a better approach.
$endgroup$
– jonnor
Apr 6 at 11:46
$begingroup$
'train model with pandas Series' does not make any sense. There is no training functionality in pandas
$endgroup$
– jonnor
Apr 6 at 11:47
$begingroup$
Please also provide a CSV with your example data. Much easier to read and to show an example from.
$endgroup$
– jonnor
Apr 6 at 11:49