How can I detect anomalies/outliers in my online streaming data on a real-time basis?How to detect a no good person in data?Time series prediction using ARIMA vs LSTMWhat methods can be used to detect anomalies in temporal texual data?Which outlier detection can detect these outliers?Difference between Global Outlier and Contextual Outlier?An outlier detection for this dataMulti Class + Negative Class Image Classification StrategiesHow would I apply anomaly detection to time series data in LSTM?Given data that is labeled as outliers, how can I classify data as outliers?How to use K-Means to detect users anomaly in Access Control

Where can I go to avoid planes overhead?

Will 700 more planes a day fly because of the Heathrow expansion?

Can I use a fetch land to shuffle my deck while the opponent has Ashiok, Dream Render in play?

Have I damaged my car by attempting to reverse with hand/park brake up?

Pressure inside an infinite ocean?

How to increase the size of the cursor in Lubuntu 19.04?

Controlled Hadamard gate in ZX-calculus

I'm in your subnets, golfing your code

How long would it take for people to notice a mass disappearance?

Verb "geeitet" in an old scientific text

How can I support myself financially as a 17 year old with a loan?

How can I close a gap between my fence and my neighbor's that's on his side of the property line?

Is there an idiom that support the idea that "inflation is bad"?

Why do people keep telling me that I am a bad photographer?

Even some useless stuff would be of use some day

How can internet speed be 10 times slower without a router than when using a router?

How to use dependency injection and avoid temporal coupling?

Homotopy limit over a diagram of nullhomotopic maps

How do LIGO and VIRGO know that a gravitational wave has its origin in a neutron star or a black hole?

What are the differences between credential stuffing and password spraying?

Does a card have a keyword if it has the same effect as said keyword?

Python - What if the end-user didn't have the required library?

Which module had more 'comfort' in terms of living space, the Lunar Module or the Command module?

Frequency of specific viral sequence in .BAM or .fastq

How can I detect anomalies/outliers in my online streaming data on a real-time basis?

How to detect a no good person in data?Time series prediction using ARIMA vs LSTMWhat methods can be used to detect anomalies in temporal texual data?Which outlier detection can detect these outliers?Difference between Global Outlier and Contextual Outlier?An outlier detection for this dataMulti Class + Negative Class Image Classification StrategiesHow would I apply anomaly detection to time series data in LSTM?Given data that is labeled as outliers, how can I classify data as outliers?How to use K-Means to detect users anomaly in Access Control

Say, I've a huge set of data(infinite in size) consisting of alternating sine wave and step pulses one after the other. What I want from my model is to parse the data sequence wise or point wise and the first time it parses a sine wave and starts facing the step pulses raise an alert as an outlier but as it goes on parsing the data it must recognise the alternating sine and step pulses and treat them as normal pattern. But then if it faces something out of this trend it must treat them as outlier, however if that new pattern repeats constantly it must treat them as normal again. In other words, my model must "remember" what it saw in the past to some extent to predict what is "normal" in the near future and on the basis of that detect anomalies in my constantly streaming data.

I've tried implementing the conventional stateless LSTM to achieve my requirements but LSTM being a supervised learning process needs an initial training and always predicts based on that initially given data. So what happens is if the pattern it recognised while training initially deviates in the test phase it always treats the pattern in the test phase to be an outlier irrespective of how many times it is repeating. Simply put, it fails to update itself with time.

I've gone through relevant papers on 'Anomaly Detection of online streaming data' and found HTM implemented by Numenta and tested on NAB benchmark is the best solution in this respect but I am looking for something open source and absolutely free to use.

Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.

asked Nov 10 '18 at 23:01

Goutam Bose

$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05

$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49

$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05

add a comment |

Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.

asked Nov 10 '18 at 23:01

Goutam Bose

$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05

$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49

$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05

add a comment |

Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.

asked Nov 10 '18 at 23:01

Goutam Bose

Being a newbie in this field, any existing open source implementation will be highly appreciated as writing something from scratch is not preferred but if required that'll be my last option.

deep-learning classification unsupervised-learning anomaly-detection stacked-lstm

asked Nov 10 '18 at 23:01

Goutam Bose

asked Nov 10 '18 at 23:01

Goutam Bose

asked Nov 10 '18 at 23:01

Goutam Bose

asked Nov 10 '18 at 23:01

Goutam Bose

asked Nov 10 '18 at 23:01

Goutam Bose

$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05

$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49

$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05

add a comment |

$begingroup$
In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.
$endgroup$
– Harsh
Nov 11 '18 at 3:05

$begingroup$
Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem
$endgroup$
– jonnor
Dec 5 '18 at 2:49

$begingroup$
My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.
$endgroup$
– Goutam Bose
Dec 6 '18 at 3:05

In the LSTM, isn't an anomoly the target? You need to have a dataset with anomalies marked, so the classifier can learn what to look for.

– Harsh
Nov 11 '18 at 3:05

Can you give some details on your input data, what it represents or from what problem domain? It could be that there are well established approaches that can be used for your problem

– jonnor
Dec 5 '18 at 2:49

My input data is live streamed network packet statistics like throughput/second, number of connections established/second, etc captured from a client-server interaction. The only solution I got till now is using HTM by Numenta. However NuPIC by Numenta isn't open source for production and hence looking for alternative solutions.

– Goutam Bose
Dec 6 '18 at 3:05

add a comment |

1 Answer
1

active

oldest

votes

There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.

Doing a search for "Anomaly Detection" on github, there seem to be entries to the NAB competition available publicly eg. nareshkumar66675/Numenta. This one has a Jupyter notebook which mainly uses Scikit learn and some custom, but simple, feature engineering. They may serve your purpose. Although the author of this one has not included Licensing information, it seems simple enough to re-implement.

However, as I understand it, the NAB datasets are more "time series" detection, i.e. a signal is an anomaly if it is very different from previous / recent values. It does not have any notion of patterns in the data, as sine pulses after step pulses, and does not include learning larger patterns as the dataset grows in size.

I'm not aware of algorithms solving your specific problem, though they might well exist in the literature. The key issue in your problem is that you cannot predict if a long sequence is an anomaly until you've seen enough data. It may suffer from combinatorial explosion.

The sines and pulses of your problem can be replaced with 0s and 1s, so your problem is one of detecting patterns in strings. Genomics is concerned with patterns in DNA, so that body of work may have what you need. (Note that is very different from Genetic algorithms)

There is an older set of algorithms called variously, Market Basket Analysis, the Apriori algorithm or Association Set Mining which has the flavor of increasing set size, but not anomaly detection. See this video explaining it. Apriori creates sets of items commonly bought together. When you have small amounts of data, you can reliably create only small patterns. As the amount of data increases you can create larger patterns.

answered Nov 11 '18 at 0:46

Harsh

67148

$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34

$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02

$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41021%2fhow-can-i-detect-anomalies-outliers-in-my-online-streaming-data-on-a-real-time-b%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.

answered Nov 11 '18 at 0:46

Harsh

67148

$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34

$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02

$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10

add a comment |

There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.

answered Nov 11 '18 at 0:46

Harsh

67148

$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34

$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02

$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10

add a comment |

There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.

answered Nov 11 '18 at 0:46

Harsh

67148

There are two well known algorithms called Isolation Forest and One-Class SVM for outlier detection. You will find implementations of these in Sckikit learn.

answered Nov 11 '18 at 0:46

Harsh

67148

answered Nov 11 '18 at 0:46

Harsh

67148

answered Nov 11 '18 at 0:46

Harsh

67148

answered Nov 11 '18 at 0:46

Harsh

67148

$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34

$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02

$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10

add a comment |

$begingroup$
thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?
$endgroup$
– Goutam Bose
Nov 11 '18 at 2:34

$begingroup$
How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.
$endgroup$
– Harsh
Nov 11 '18 at 3:02

$begingroup$
Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.
$endgroup$
– Goutam Bose
Nov 11 '18 at 6:10

thanks a lot sir. But can Isolation Forest be used to detect outlier for streaming data where the full dataset is not available beforehand and is being streamed point by point as I have mentioned in my question?

– Goutam Bose
Nov 11 '18 at 2:34

How important is that restriction? If you have a trained classifier, you can use it to detect new anomalies immediately, but you can train a new classifier once every hour and use all the newly collected data. If you do this, then you'll be open to using all the batch learning algorithms / software available. To answer your question, there are papers that discuss Online Isolation Forests, but I'm not aware of any implementations.

– Harsh
Nov 11 '18 at 3:02

Thanks for replying. The restriction is pretty stringent as the model will be used by network load testing software to detect anomalies in network packet streamed between client and server. So the data to be tested is not available beforehand and any form of beforehand training is not a preferred option. The model should be preferably unsupervised.

– Goutam Bose
Nov 11 '18 at 6:10

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

wIz37 lBJN

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1