How can I detect patterns and/or keywords or phrases? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsWhere can I download historical market capitalization and daily turnover data for stocks?Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?How can I access dataset from Nasa websiteHow can I look up classes of ImageNet?Can HDF5 be reliably written to and read from simultaneously by separate python processes?How can l get 50 % examples in training set and 50% in test set for each class when splitting data?How can I download data from online table quickly?How can I obtain datasets about human poses?Can we access social media advertisements and information like how many likes, comments, shares they received using their APIs?How can I improve a machine learning model?

What *exactly* is electrical current, voltage, and resistance?

A strange hotel

Is there metaphorical meaning of "aus der Haft entlassen"?

Why didn't the Space Shuttle bounce back into space as many times as possible so as to lose a lot of kinetic energy up there?

Do I need to watch Ant-Man and the Wasp and Captain Marvel before watching Avengers: Endgame?

Air bladders in bat-like skin wings for better lift?

Has a Nobel Peace laureate ever been accused of war crimes?

Map material from china not allowed to leave the country

Protagonist's race is hidden - should I reveal it?

Character Optimization: Shillelagh Whirlwind

Is it acceptable to use working hours to read general interest books?

Why doesn't the standard consider a template constructor as a copy constructor?

Is accepting an invalid credit card number a security issue?

All ASCII characters with a given bit count

Is it possible to cast 2x Final Payment while sacrificing just one creature?

Unable to completely uninstall Zoom meeting app

How exactly does Hawking radiation decrease the mass of black holes?

What is /etc/mtab in Linux?

tikz-feynman: edge labels

A Paper Record is What I Hamper

Raising a bilingual kid. When should we introduce the majority language?

Which big number is bigger?

Scheduling based problem

How to not starve gigantic beasts



How can I detect patterns and/or keywords or phrases?



Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsWhere can I download historical market capitalization and daily turnover data for stocks?Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?How can I access dataset from Nasa websiteHow can I look up classes of ImageNet?Can HDF5 be reliably written to and read from simultaneously by separate python processes?How can l get 50 % examples in training set and 50% in test set for each class when splitting data?How can I download data from online table quickly?How can I obtain datasets about human poses?Can we access social media advertisements and information like how many likes, comments, shares they received using their APIs?How can I improve a machine learning model?










0












$begingroup$


I am collecting data in a database via php from apache.



I am interested in detecting patterns in each column for now.



For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.



How would I detect that programmatically using the computer instead of my brain?



I am going to need a detailed explanation as I am brand new to doing this kind of thing.



A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127










share|improve this question











$endgroup$











  • $begingroup$
    Using regex expressions + string library?
    $endgroup$
    – Aditya
    Apr 6 at 7:19
















0












$begingroup$


I am collecting data in a database via php from apache.



I am interested in detecting patterns in each column for now.



For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.



How would I detect that programmatically using the computer instead of my brain?



I am going to need a detailed explanation as I am brand new to doing this kind of thing.



A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127










share|improve this question











$endgroup$











  • $begingroup$
    Using regex expressions + string library?
    $endgroup$
    – Aditya
    Apr 6 at 7:19














0












0








0





$begingroup$


I am collecting data in a database via php from apache.



I am interested in detecting patterns in each column for now.



For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.



How would I detect that programmatically using the computer instead of my brain?



I am going to need a detailed explanation as I am brand new to doing this kind of thing.



A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127










share|improve this question











$endgroup$




I am collecting data in a database via php from apache.



I am interested in detecting patterns in each column for now.



For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.



How would I detect that programmatically using the computer instead of my brain?



I am going to need a detailed explanation as I am brand new to doing this kind of thing.



A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127







dataset






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 7 at 16:15







cybernard

















asked Apr 6 at 3:19









cybernardcybernard

1011




1011











  • $begingroup$
    Using regex expressions + string library?
    $endgroup$
    – Aditya
    Apr 6 at 7:19

















  • $begingroup$
    Using regex expressions + string library?
    $endgroup$
    – Aditya
    Apr 6 at 7:19
















$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19





$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19











1 Answer
1






active

oldest

votes


















0












$begingroup$

Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.



BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)



For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:



the 2
cat 1
sat 1
on 1
mat 1


To get frequencies, you just divide the resulting vectors by the total count of words.



[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.



In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist






share|improve this answer











$endgroup$












  • $begingroup$
    A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
    $endgroup$
    – cybernard
    Apr 6 at 15:34











  • $begingroup$
    Have you tried what I suggested? Does it work for you?
    $endgroup$
    – qmeeus
    Apr 9 at 15:04










  • $begingroup$
    I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
    $endgroup$
    – cybernard
    Apr 9 at 15:35










  • $begingroup$
    I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
    $endgroup$
    – qmeeus
    Apr 10 at 10:02










  • $begingroup$
    The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
    $endgroup$
    – cybernard
    Apr 10 at 12:01











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48724%2fhow-can-i-detect-patterns-and-or-keywords-or-phrases%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.



BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)



For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:



the 2
cat 1
sat 1
on 1
mat 1


To get frequencies, you just divide the resulting vectors by the total count of words.



[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.



In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist






share|improve this answer











$endgroup$












  • $begingroup$
    A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
    $endgroup$
    – cybernard
    Apr 6 at 15:34











  • $begingroup$
    Have you tried what I suggested? Does it work for you?
    $endgroup$
    – qmeeus
    Apr 9 at 15:04










  • $begingroup$
    I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
    $endgroup$
    – cybernard
    Apr 9 at 15:35










  • $begingroup$
    I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
    $endgroup$
    – qmeeus
    Apr 10 at 10:02










  • $begingroup$
    The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
    $endgroup$
    – cybernard
    Apr 10 at 12:01















0












$begingroup$

Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.



BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)



For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:



the 2
cat 1
sat 1
on 1
mat 1


To get frequencies, you just divide the resulting vectors by the total count of words.



[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.



In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist






share|improve this answer











$endgroup$












  • $begingroup$
    A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
    $endgroup$
    – cybernard
    Apr 6 at 15:34











  • $begingroup$
    Have you tried what I suggested? Does it work for you?
    $endgroup$
    – qmeeus
    Apr 9 at 15:04










  • $begingroup$
    I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
    $endgroup$
    – cybernard
    Apr 9 at 15:35










  • $begingroup$
    I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
    $endgroup$
    – qmeeus
    Apr 10 at 10:02










  • $begingroup$
    The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
    $endgroup$
    – cybernard
    Apr 10 at 12:01













0












0








0





$begingroup$

Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.



BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)



For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:



the 2
cat 1
sat 1
on 1
mat 1


To get frequencies, you just divide the resulting vectors by the total count of words.



[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.



In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist






share|improve this answer











$endgroup$



Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.



BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)



For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:



the 2
cat 1
sat 1
on 1
mat 1


To get frequencies, you just divide the resulting vectors by the total count of words.



[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.



In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist







share|improve this answer














share|improve this answer



share|improve this answer








edited Apr 10 at 17:06

























answered Apr 6 at 10:43









qmeeusqmeeus

30129




30129











  • $begingroup$
    A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
    $endgroup$
    – cybernard
    Apr 6 at 15:34











  • $begingroup$
    Have you tried what I suggested? Does it work for you?
    $endgroup$
    – qmeeus
    Apr 9 at 15:04










  • $begingroup$
    I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
    $endgroup$
    – cybernard
    Apr 9 at 15:35










  • $begingroup$
    I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
    $endgroup$
    – qmeeus
    Apr 10 at 10:02










  • $begingroup$
    The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
    $endgroup$
    – cybernard
    Apr 10 at 12:01
















  • $begingroup$
    A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
    $endgroup$
    – cybernard
    Apr 6 at 15:34











  • $begingroup$
    Have you tried what I suggested? Does it work for you?
    $endgroup$
    – qmeeus
    Apr 9 at 15:04










  • $begingroup$
    I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
    $endgroup$
    – cybernard
    Apr 9 at 15:35










  • $begingroup$
    I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
    $endgroup$
    – qmeeus
    Apr 10 at 10:02










  • $begingroup$
    The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
    $endgroup$
    – cybernard
    Apr 10 at 12:01















$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34





$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34













$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04




$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04












$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35




$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35












$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02




$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02












$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01




$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48724%2fhow-can-i-detect-patterns-and-or-keywords-or-phrases%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High