How can I detect patterns and/or keywords or phrases? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsWhere can I download historical market capitalization and daily turnover data for stocks?Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?How can I access dataset from Nasa websiteHow can I look up classes of ImageNet?Can HDF5 be reliably written to and read from simultaneously by separate python processes?How can l get 50 % examples in training set and 50% in test set for each class when splitting data?How can I download data from online table quickly?How can I obtain datasets about human poses?Can we access social media advertisements and information like how many likes, comments, shares they received using their APIs?How can I improve a machine learning model?
What *exactly* is electrical current, voltage, and resistance?
A strange hotel
Is there metaphorical meaning of "aus der Haft entlassen"?
Why didn't the Space Shuttle bounce back into space as many times as possible so as to lose a lot of kinetic energy up there?
Do I need to watch Ant-Man and the Wasp and Captain Marvel before watching Avengers: Endgame?
Air bladders in bat-like skin wings for better lift?
Has a Nobel Peace laureate ever been accused of war crimes?
Map material from china not allowed to leave the country
Protagonist's race is hidden - should I reveal it?
Character Optimization: Shillelagh Whirlwind
Is it acceptable to use working hours to read general interest books?
Why doesn't the standard consider a template constructor as a copy constructor?
Is accepting an invalid credit card number a security issue?
All ASCII characters with a given bit count
Is it possible to cast 2x Final Payment while sacrificing just one creature?
Unable to completely uninstall Zoom meeting app
How exactly does Hawking radiation decrease the mass of black holes?
What is /etc/mtab in Linux?
tikz-feynman: edge labels
A Paper Record is What I Hamper
Raising a bilingual kid. When should we introduce the majority language?
Which big number is bigger?
Scheduling based problem
How to not starve gigantic beasts
How can I detect patterns and/or keywords or phrases?
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsWhere can I download historical market capitalization and daily turnover data for stocks?Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?How can I access dataset from Nasa websiteHow can I look up classes of ImageNet?Can HDF5 be reliably written to and read from simultaneously by separate python processes?How can l get 50 % examples in training set and 50% in test set for each class when splitting data?How can I download data from online table quickly?How can I obtain datasets about human poses?Can we access social media advertisements and information like how many likes, comments, shares they received using their APIs?How can I improve a machine learning model?
$begingroup$
I am collecting data in a database via php from apache.
I am interested in detecting patterns in each column for now.
For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.
How would I detect that programmatically using the computer instead of my brain?
I am going to need a detailed explanation as I am brand new to doing this kind of thing.
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127
dataset
$endgroup$
add a comment |
$begingroup$
I am collecting data in a database via php from apache.
I am interested in detecting patterns in each column for now.
For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.
How would I detect that programmatically using the computer instead of my brain?
I am going to need a detailed explanation as I am brand new to doing this kind of thing.
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127
dataset
$endgroup$
$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19
add a comment |
$begingroup$
I am collecting data in a database via php from apache.
I am interested in detecting patterns in each column for now.
For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.
How would I detect that programmatically using the computer instead of my brain?
I am going to need a detailed explanation as I am brand new to doing this kind of thing.
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127
dataset
$endgroup$
I am collecting data in a database via php from apache.
I am interested in detecting patterns in each column for now.
For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.
How would I detect that programmatically using the computer instead of my brain?
I am going to need a detailed explanation as I am brand new to doing this kind of thing.
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127
dataset
dataset
edited Apr 7 at 16:15
cybernard
asked Apr 6 at 3:19
cybernardcybernard
1011
1011
$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19
add a comment |
$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19
$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19
$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.
BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)
For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:
the 2
cat 1
sat 1
on 1
mat 1
To get frequencies, you just divide the resulting vectors by the total count of words.
[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.
In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist
$endgroup$
$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34
$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04
$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35
$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02
$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01
|
show 2 more comments
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48724%2fhow-can-i-detect-patterns-and-or-keywords-or-phrases%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.
BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)
For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:
the 2
cat 1
sat 1
on 1
mat 1
To get frequencies, you just divide the resulting vectors by the total count of words.
[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.
In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist
$endgroup$
$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34
$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04
$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35
$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02
$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01
|
show 2 more comments
$begingroup$
Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.
BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)
For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:
the 2
cat 1
sat 1
on 1
mat 1
To get frequencies, you just divide the resulting vectors by the total count of words.
[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.
In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist
$endgroup$
$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34
$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04
$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35
$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02
$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01
|
show 2 more comments
$begingroup$
Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.
BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)
For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:
the 2
cat 1
sat 1
on 1
mat 1
To get frequencies, you just divide the resulting vectors by the total count of words.
[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.
In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist
$endgroup$
Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.
BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)
For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:
the 2
cat 1
sat 1
on 1
mat 1
To get frequencies, you just divide the resulting vectors by the total count of words.
[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.
In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist
edited Apr 10 at 17:06
answered Apr 6 at 10:43
qmeeusqmeeus
30129
30129
$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34
$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04
$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35
$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02
$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01
|
show 2 more comments
$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34
$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04
$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35
$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02
$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01
$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34
$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34
$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04
$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04
$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35
$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35
$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02
$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02
$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01
$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01
|
show 2 more comments
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48724%2fhow-can-i-detect-patterns-and-or-keywords-or-phrases%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19