How can I detect patterns and/or keywords or phrases? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsWhere can I download historical market capitalization and daily turnover data for stocks?Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?How can I access dataset from Nasa websiteHow can I look up classes of ImageNet?Can HDF5 be reliably written to and read from simultaneously by separate python processes?How can l get 50 % examples in training set and 50% in test set for each class when splitting data?How can I download data from online table quickly?How can I obtain datasets about human poses?Can we access social media advertisements and information like how many likes, comments, shares they received using their APIs?How can I improve a machine learning model?

What *exactly* is electrical current, voltage, and resistance?

A strange hotel

Is there metaphorical meaning of "aus der Haft entlassen"?

Why didn't the Space Shuttle bounce back into space as many times as possible so as to lose a lot of kinetic energy up there?

Do I need to watch Ant-Man and the Wasp and Captain Marvel before watching Avengers: Endgame?

Air bladders in bat-like skin wings for better lift?

Has a Nobel Peace laureate ever been accused of war crimes?

Map material from china not allowed to leave the country

Protagonist's race is hidden - should I reveal it?

Character Optimization: Shillelagh Whirlwind

Is it acceptable to use working hours to read general interest books?

Why doesn't the standard consider a template constructor as a copy constructor?

Is accepting an invalid credit card number a security issue?

All ASCII characters with a given bit count

Is it possible to cast 2x Final Payment while sacrificing just one creature?

Unable to completely uninstall Zoom meeting app

How exactly does Hawking radiation decrease the mass of black holes?

What is /etc/mtab in Linux?

tikz-feynman: edge labels

A Paper Record is What I Hamper

Raising a bilingual kid. When should we introduce the majority language?

Which big number is bigger?

Scheduling based problem

How to not starve gigantic beasts

How can I detect patterns and/or keywords or phrases?

Unicorn Meta Zoo #1: Why another podcast?

Announcing the arrival of Valued Associate #679: Cesar Manara

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsWhere can I download historical market capitalization and daily turnover data for stocks?Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?How can I access dataset from Nasa websiteHow can I look up classes of ImageNet?Can HDF5 be reliably written to and read from simultaneously by separate python processes?How can l get 50 % examples in training set and 50% in test set for each class when splitting data?How can I download data from online table quickly?How can I obtain datasets about human poses?Can we access social media advertisements and information like how many likes, comments, shares they received using their APIs?How can I improve a machine learning model?

I am collecting data in a database via php from apache.

I am interested in detecting patterns in each column for now.

For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.

How would I detect that programmatically using the computer instead of my brain?

I am going to need a detailed explanation as I am brand new to doing this kind of thing.

A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127

edited Apr 7 at 16:15

asked Apr 6 at 3:19

cybernard

1011

$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19

add a comment |

I am collecting data in a database via php from apache.

I am interested in detecting patterns in each column for now.

For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.

How would I detect that programmatically using the computer instead of my brain?

I am going to need a detailed explanation as I am brand new to doing this kind of thing.

edited Apr 7 at 16:15

asked Apr 6 at 3:19

cybernard

1011

$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19

add a comment |

I am collecting data in a database via php from apache.

I am interested in detecting patterns in each column for now.

For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.

How would I detect that programmatically using the computer instead of my brain?

I am going to need a detailed explanation as I am brand new to doing this kind of thing.

edited Apr 7 at 16:15

asked Apr 6 at 3:19

cybernard

1011

I am collecting data in a database via php from apache.

I am interested in detecting patterns in each column for now.

For example manual examination of the data shows the pattern phpmyadmin is various forms and capitalization and at different positions in the text. Also to detect any other patterns.

How would I detect that programmatically using the computer instead of my brain?

I am going to need a detailed explanation as I am brand new to doing this kind of thing.

dataset

edited Apr 7 at 16:15

asked Apr 6 at 3:19

cybernard

1011

edited Apr 7 at 16:15

asked Apr 6 at 3:19

cybernard

1011

edited Apr 7 at 16:15

asked Apr 6 at 3:19

cybernard

1011

asked Apr 6 at 3:19

cybernard

1011

asked Apr 6 at 3:19

cybernard

1011

$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19

add a comment |

$begingroup$
Using regex expressions + string library?
$endgroup$
– Aditya
Apr 6 at 7:19

Using regex expressions + string library?

– Aditya
Apr 6 at 7:19

add a comment |

1 Answer
1

active

oldest

votes

Depends on what you want to do and what you define as a "pattern". If you are interested in frequent terms, then tokenize and count the words. If you want to compare various forms of the same terms, I suggest you build two term matrices, one where you take the input as is and one where you take a version of the input that has been transformed to lower case.

BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)

For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:

the 2
cat 1
sat 1
on 1
mat 1

To get frequencies, you just divide the resulting vectors by the total count of words.

[EDIT] After reading the comments, I would recommend that you explore solutions for anomaly detection. Here is a good paper that is tackling a similar problem as yours. The authors encode documents as bag of words using term frequencies (see above) and build a classifier to flag a whole document as an outlier.

In your case, it is a little trickier because you want to flag specific terms as good or bad, and you don't have a labelled dataset. So either find a labelled dataset (no need for a fully labelled dataset, semi-supervised solutions exist but at least a few labels would be nice) or explore unsupervised solutions to outlier detection. As for pattern flagging, your best bet is to use recurrent neural networks, which are designed to process sequential data such as yours. That being said, I doubt you will achieve that with PHP, so maybe consider something else. Tensorflow / PyTorch / Neo4J are very good alternative for building such models and many other exist

edited Apr 10 at 17:06

answered Apr 6 at 10:43

qmeeus

30129

$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34

$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04

$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35

$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02

$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01

|
show 2 more comments

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48724%2fhow-can-i-detect-patterns-and-or-keywords-or-phrases%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)

For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:

the 2
cat 1
sat 1
on 1
mat 1

To get frequencies, you just divide the resulting vectors by the total count of words.

edited Apr 10 at 17:06

answered Apr 6 at 10:43

qmeeus

30129

$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34

$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04

$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35

$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02

$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01

|
show 2 more comments

BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)

For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:

the 2
cat 1
sat 1
on 1
mat 1

To get frequencies, you just divide the resulting vectors by the total count of words.

edited Apr 10 at 17:06

answered Apr 6 at 10:43

qmeeus

30129

$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34

$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04

$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35

$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02

$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01

|
show 2 more comments

BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)

For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:

the 2
cat 1
sat 1
on 1
mat 1

To get frequencies, you just divide the resulting vectors by the total count of words.

edited Apr 10 at 17:06

answered Apr 6 at 10:43

qmeeus

30129

BTW, a term frequency matrix is simply a matrix where the rows are your examples (I guess the columns in your database) and the columns are the discovered tokens (ie, the words)

For example, in the phrase 'the cat sat on the mat', the corresponding row vector of word counts would be:

the 2
cat 1
sat 1
on 1
mat 1

To get frequencies, you just divide the resulting vectors by the total count of words.

edited Apr 10 at 17:06

answered Apr 6 at 10:43

qmeeus

30129

edited Apr 10 at 17:06

answered Apr 6 at 10:43

qmeeus

30129

answered Apr 6 at 10:43

qmeeus

30129

answered Apr 6 at 10:43

qmeeus

30129

$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34

$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04

$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35

$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02

$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01

|
show 2 more comments

$begingroup$
A pattern is any sequence of characters that repeat together. Like phpmyadmin is a sequence of 9 letters that repeat together. I detected "phpmyadmin" manually, I would like to detect it programmatically. Excluding single digit matches of common characters which is ascii 32-127.
$endgroup$
– cybernard
Apr 6 at 15:34

$begingroup$
Have you tried what I suggested? Does it work for you?
$endgroup$
– qmeeus
Apr 9 at 15:04

$begingroup$
I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.
$endgroup$
– cybernard
Apr 9 at 15:35

$begingroup$
I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you
$endgroup$
– qmeeus
Apr 10 at 10:02

$begingroup$
The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.
$endgroup$
– cybernard
Apr 10 at 12:01

Have you tried what I suggested? Does it work for you?

– qmeeus
Apr 9 at 15:04

I am at kind of a loss on where to begin, since I haven't done this before. I have thought of setting up arrays two[],three[],four[] and etc where each represent the number of letters inside. two(x)=mid(string,x,2) three(x)=mid(string,x,3) and so on. Then searching each corresponding array and initiating a counter. So if "aa" repeated then its counter would =2. Then the count array could be sorted in descending order. However, this all sounds very inefficient.

– cybernard
Apr 9 at 15:35

I can"t help you with php, but I'm pretty sure you should be able to find libraries to analyse text (a quick google search on "php tokenize text" gave me 96,000 results). Text processing is more my domain though and if I can give you an advice, you should first split the text rather than building all possible sequences of characters observed in your input. Maybe that if you explain why you want to do that, or what result you expect I will be able to help you

– qmeeus
Apr 10 at 10:02

The data comes from a web server, where random people/bots request things many of which don't exist. I would like to find the individual patterns so I can tag them good,bad, or indifferent. A pattern like phpmyadmin is moved around just incase I renamed it. Then if you get too many bad points the IP will be banned.

– cybernard
Apr 10 at 12:01

|
show 2 more comments

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1