Is there any clustering algorithm to find longest continuous subsequences?2019 Community Moderator ElectionMusic corpus sentence level clusteringHow does Elastic's Prelert (formerly Splunk Anomaly Detective App) work?Is Clustering used in real world systems/products involving large amounts of data? How are the nuances taken care of?Seeking Appropriate Clustering AlgorithmAppropriate Clustering AlgorithmClustering customer dataset to find customer patternsClustering with multiple distance measuresExtract Pattern using Short Text ProcessingHow does ML Clustering put to a practical real-world use?Is there an oriented clustering algorithm?

Are objects structures and/or vice versa?

I see my dog run

How can I fix this gap between bookcases I made?

A poker game description that does not feel gimmicky

Finding files for which a command fails

Where else does the Shulchan Aruch quote an authority by name?

Is it possible to make sharp wind that can cut stuff from afar?

Why doesn't a const reference extend the life of a temporary object passed via a function?

Why do UK politicians seemingly ignore opinion polls on Brexit?

Need help identifying/translating a plaque in Tangier, Morocco

If a centaur druid Wild Shapes into a Giant Elk, do their Charge features stack?

Does it makes sense to buy a new cycle to learn riding?

Calculate Levenshtein distance between two strings in Python

Where to refill my bottle in India?

Why airport relocation isn't done gradually?

Can a planet have a different gravitational pull depending on its location in orbit around its sun?

Unbreakable Formation vs. Cry of the Carnarium

Denied boarding due to overcrowding, Sparpreis ticket. What are my rights?

Filling an area between two curves

What do you call something that goes against the spirit of the law, but is legal when interpreting the law to the letter?

How to make payment on the internet without leaving a money trail?

New order #4: World

Does the average primeness of natural numbers tend to zero?

Domain expired, GoDaddy holds it and is asking more money



Is there any clustering algorithm to find longest continuous subsequences?



2019 Community Moderator ElectionMusic corpus sentence level clusteringHow does Elastic's Prelert (formerly Splunk Anomaly Detective App) work?Is Clustering used in real world systems/products involving large amounts of data? How are the nuances taken care of?Seeking Appropriate Clustering AlgorithmAppropriate Clustering AlgorithmClustering customer dataset to find customer patternsClustering with multiple distance measuresExtract Pattern using Short Text ProcessingHow does ML Clustering put to a practical real-world use?Is there an oriented clustering algorithm?










1












$begingroup$


I have data which contains access duration of some items.



Example:

t0~t5 is the access time duration, 1 means the items was accessed in the time duration, 0 means it wasn't.



ID,t0,t1,t2,t3,t4
0,0,0,1,1,1
1,0,1,1,1,1
2,0,1,1,0,0
3,1,1,0,0,1
4,1,1,0,0,1


In the above example, groups ID=0,1 are what I want.



ID=3,4 aren't because their distance is short but they are not continuous.



I tried KMeans and DBSCAN, they all cluster ID=3,4 as one group and it makes sense. But it doesn't do what I want.



Is there any possible pre-processing of data to reach what I want ?



Or I should use other analytic tool?










share|improve this question











$endgroup$
















    1












    $begingroup$


    I have data which contains access duration of some items.



    Example:

    t0~t5 is the access time duration, 1 means the items was accessed in the time duration, 0 means it wasn't.



    ID,t0,t1,t2,t3,t4
    0,0,0,1,1,1
    1,0,1,1,1,1
    2,0,1,1,0,0
    3,1,1,0,0,1
    4,1,1,0,0,1


    In the above example, groups ID=0,1 are what I want.



    ID=3,4 aren't because their distance is short but they are not continuous.



    I tried KMeans and DBSCAN, they all cluster ID=3,4 as one group and it makes sense. But it doesn't do what I want.



    Is there any possible pre-processing of data to reach what I want ?



    Or I should use other analytic tool?










    share|improve this question











    $endgroup$














      1












      1








      1





      $begingroup$


      I have data which contains access duration of some items.



      Example:

      t0~t5 is the access time duration, 1 means the items was accessed in the time duration, 0 means it wasn't.



      ID,t0,t1,t2,t3,t4
      0,0,0,1,1,1
      1,0,1,1,1,1
      2,0,1,1,0,0
      3,1,1,0,0,1
      4,1,1,0,0,1


      In the above example, groups ID=0,1 are what I want.



      ID=3,4 aren't because their distance is short but they are not continuous.



      I tried KMeans and DBSCAN, they all cluster ID=3,4 as one group and it makes sense. But it doesn't do what I want.



      Is there any possible pre-processing of data to reach what I want ?



      Or I should use other analytic tool?










      share|improve this question











      $endgroup$




      I have data which contains access duration of some items.



      Example:

      t0~t5 is the access time duration, 1 means the items was accessed in the time duration, 0 means it wasn't.



      ID,t0,t1,t2,t3,t4
      0,0,0,1,1,1
      1,0,1,1,1,1
      2,0,1,1,0,0
      3,1,1,0,0,1
      4,1,1,0,0,1


      In the above example, groups ID=0,1 are what I want.



      ID=3,4 aren't because their distance is short but they are not continuous.



      I tried KMeans and DBSCAN, they all cluster ID=3,4 as one group and it makes sense. But it doesn't do what I want.



      Is there any possible pre-processing of data to reach what I want ?



      Or I should use other analytic tool?







      machine-learning python clustering






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 1 at 7:53







      code_worker

















      asked Mar 29 at 9:13









      code_workercode_worker

      186




      186




















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          What might help is a custom distance computation as input to the clustering algorithm. These algorithms usually take Euclidean distance as a measure of dissimilarity.



          You can try DBSCAN (in Python scikit-learn), with metric='precomputed' and 'X' as a custom distance matrix. You can construct this distance matrix to conform to your requirement. Eg: specify that nodes 3 and 4 have a large distance, even though they are equal.






          share|improve this answer









          $endgroup$












          • $begingroup$
            What does "custom distance matrix" mean? thx
            $endgroup$
            – code_worker
            Mar 30 at 14:45










          • $begingroup$
            This is the matrix of pairwise distances that you will compute in the way that you want. Eg: 3 and 4 will have a large distance. You then pass this as input to the clustering algorithm.
            $endgroup$
            – raghu
            Mar 31 at 4:16










          • $begingroup$
            So I need to define the matrix myself. thx. I originally thought there's a common way to define it.
            $endgroup$
            – code_worker
            Mar 31 at 8:26


















          0












          $begingroup$

          It is probably worth to transform the data, such that each record is "continuous" (call it differently - for example "contiguous" because the term continuous has a widely known mathematical meaning), and if necessary make multiple copies.



          K-means minimizes the sum of squares. I don't see how that is beneficial here.



          There is generalized DBSCAN. You can define arbitrary neighbor predicates for it. For example, you could the define that neighbors (candidates for merging into the same cluster) must have a contiguous overlap of at least two active timepoints. Then consider whether this satisfies your notion of clusters because of the transitivity computed by DBSCAN.



          My guess is that you'll rather want to e.g. extract all contiguous subsequences of a minimum length - say, 2 - of all records and simply count them to identify the most frequent subsequences. If you implement this with an efficient bit representation, then it will be very fast.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Now I don't know how to transform data and cluster it to reach what I want.
            $endgroup$
            – code_worker
            Apr 1 at 1:59











          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48201%2fis-there-any-clustering-algorithm-to-find-longest-continuous-subsequences%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          What might help is a custom distance computation as input to the clustering algorithm. These algorithms usually take Euclidean distance as a measure of dissimilarity.



          You can try DBSCAN (in Python scikit-learn), with metric='precomputed' and 'X' as a custom distance matrix. You can construct this distance matrix to conform to your requirement. Eg: specify that nodes 3 and 4 have a large distance, even though they are equal.






          share|improve this answer









          $endgroup$












          • $begingroup$
            What does "custom distance matrix" mean? thx
            $endgroup$
            – code_worker
            Mar 30 at 14:45










          • $begingroup$
            This is the matrix of pairwise distances that you will compute in the way that you want. Eg: 3 and 4 will have a large distance. You then pass this as input to the clustering algorithm.
            $endgroup$
            – raghu
            Mar 31 at 4:16










          • $begingroup$
            So I need to define the matrix myself. thx. I originally thought there's a common way to define it.
            $endgroup$
            – code_worker
            Mar 31 at 8:26















          1












          $begingroup$

          What might help is a custom distance computation as input to the clustering algorithm. These algorithms usually take Euclidean distance as a measure of dissimilarity.



          You can try DBSCAN (in Python scikit-learn), with metric='precomputed' and 'X' as a custom distance matrix. You can construct this distance matrix to conform to your requirement. Eg: specify that nodes 3 and 4 have a large distance, even though they are equal.






          share|improve this answer









          $endgroup$












          • $begingroup$
            What does "custom distance matrix" mean? thx
            $endgroup$
            – code_worker
            Mar 30 at 14:45










          • $begingroup$
            This is the matrix of pairwise distances that you will compute in the way that you want. Eg: 3 and 4 will have a large distance. You then pass this as input to the clustering algorithm.
            $endgroup$
            – raghu
            Mar 31 at 4:16










          • $begingroup$
            So I need to define the matrix myself. thx. I originally thought there's a common way to define it.
            $endgroup$
            – code_worker
            Mar 31 at 8:26













          1












          1








          1





          $begingroup$

          What might help is a custom distance computation as input to the clustering algorithm. These algorithms usually take Euclidean distance as a measure of dissimilarity.



          You can try DBSCAN (in Python scikit-learn), with metric='precomputed' and 'X' as a custom distance matrix. You can construct this distance matrix to conform to your requirement. Eg: specify that nodes 3 and 4 have a large distance, even though they are equal.






          share|improve this answer









          $endgroup$



          What might help is a custom distance computation as input to the clustering algorithm. These algorithms usually take Euclidean distance as a measure of dissimilarity.



          You can try DBSCAN (in Python scikit-learn), with metric='precomputed' and 'X' as a custom distance matrix. You can construct this distance matrix to conform to your requirement. Eg: specify that nodes 3 and 4 have a large distance, even though they are equal.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 29 at 17:34









          raghuraghu

          45633




          45633











          • $begingroup$
            What does "custom distance matrix" mean? thx
            $endgroup$
            – code_worker
            Mar 30 at 14:45










          • $begingroup$
            This is the matrix of pairwise distances that you will compute in the way that you want. Eg: 3 and 4 will have a large distance. You then pass this as input to the clustering algorithm.
            $endgroup$
            – raghu
            Mar 31 at 4:16










          • $begingroup$
            So I need to define the matrix myself. thx. I originally thought there's a common way to define it.
            $endgroup$
            – code_worker
            Mar 31 at 8:26
















          • $begingroup$
            What does "custom distance matrix" mean? thx
            $endgroup$
            – code_worker
            Mar 30 at 14:45










          • $begingroup$
            This is the matrix of pairwise distances that you will compute in the way that you want. Eg: 3 and 4 will have a large distance. You then pass this as input to the clustering algorithm.
            $endgroup$
            – raghu
            Mar 31 at 4:16










          • $begingroup$
            So I need to define the matrix myself. thx. I originally thought there's a common way to define it.
            $endgroup$
            – code_worker
            Mar 31 at 8:26















          $begingroup$
          What does "custom distance matrix" mean? thx
          $endgroup$
          – code_worker
          Mar 30 at 14:45




          $begingroup$
          What does "custom distance matrix" mean? thx
          $endgroup$
          – code_worker
          Mar 30 at 14:45












          $begingroup$
          This is the matrix of pairwise distances that you will compute in the way that you want. Eg: 3 and 4 will have a large distance. You then pass this as input to the clustering algorithm.
          $endgroup$
          – raghu
          Mar 31 at 4:16




          $begingroup$
          This is the matrix of pairwise distances that you will compute in the way that you want. Eg: 3 and 4 will have a large distance. You then pass this as input to the clustering algorithm.
          $endgroup$
          – raghu
          Mar 31 at 4:16












          $begingroup$
          So I need to define the matrix myself. thx. I originally thought there's a common way to define it.
          $endgroup$
          – code_worker
          Mar 31 at 8:26




          $begingroup$
          So I need to define the matrix myself. thx. I originally thought there's a common way to define it.
          $endgroup$
          – code_worker
          Mar 31 at 8:26











          0












          $begingroup$

          It is probably worth to transform the data, such that each record is "continuous" (call it differently - for example "contiguous" because the term continuous has a widely known mathematical meaning), and if necessary make multiple copies.



          K-means minimizes the sum of squares. I don't see how that is beneficial here.



          There is generalized DBSCAN. You can define arbitrary neighbor predicates for it. For example, you could the define that neighbors (candidates for merging into the same cluster) must have a contiguous overlap of at least two active timepoints. Then consider whether this satisfies your notion of clusters because of the transitivity computed by DBSCAN.



          My guess is that you'll rather want to e.g. extract all contiguous subsequences of a minimum length - say, 2 - of all records and simply count them to identify the most frequent subsequences. If you implement this with an efficient bit representation, then it will be very fast.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Now I don't know how to transform data and cluster it to reach what I want.
            $endgroup$
            – code_worker
            Apr 1 at 1:59















          0












          $begingroup$

          It is probably worth to transform the data, such that each record is "continuous" (call it differently - for example "contiguous" because the term continuous has a widely known mathematical meaning), and if necessary make multiple copies.



          K-means minimizes the sum of squares. I don't see how that is beneficial here.



          There is generalized DBSCAN. You can define arbitrary neighbor predicates for it. For example, you could the define that neighbors (candidates for merging into the same cluster) must have a contiguous overlap of at least two active timepoints. Then consider whether this satisfies your notion of clusters because of the transitivity computed by DBSCAN.



          My guess is that you'll rather want to e.g. extract all contiguous subsequences of a minimum length - say, 2 - of all records and simply count them to identify the most frequent subsequences. If you implement this with an efficient bit representation, then it will be very fast.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Now I don't know how to transform data and cluster it to reach what I want.
            $endgroup$
            – code_worker
            Apr 1 at 1:59













          0












          0








          0





          $begingroup$

          It is probably worth to transform the data, such that each record is "continuous" (call it differently - for example "contiguous" because the term continuous has a widely known mathematical meaning), and if necessary make multiple copies.



          K-means minimizes the sum of squares. I don't see how that is beneficial here.



          There is generalized DBSCAN. You can define arbitrary neighbor predicates for it. For example, you could the define that neighbors (candidates for merging into the same cluster) must have a contiguous overlap of at least two active timepoints. Then consider whether this satisfies your notion of clusters because of the transitivity computed by DBSCAN.



          My guess is that you'll rather want to e.g. extract all contiguous subsequences of a minimum length - say, 2 - of all records and simply count them to identify the most frequent subsequences. If you implement this with an efficient bit representation, then it will be very fast.






          share|improve this answer









          $endgroup$



          It is probably worth to transform the data, such that each record is "continuous" (call it differently - for example "contiguous" because the term continuous has a widely known mathematical meaning), and if necessary make multiple copies.



          K-means minimizes the sum of squares. I don't see how that is beneficial here.



          There is generalized DBSCAN. You can define arbitrary neighbor predicates for it. For example, you could the define that neighbors (candidates for merging into the same cluster) must have a contiguous overlap of at least two active timepoints. Then consider whether this satisfies your notion of clusters because of the transitivity computed by DBSCAN.



          My guess is that you'll rather want to e.g. extract all contiguous subsequences of a minimum length - say, 2 - of all records and simply count them to identify the most frequent subsequences. If you implement this with an efficient bit representation, then it will be very fast.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 30 at 9:59









          Anony-MousseAnony-Mousse

          5,165625




          5,165625











          • $begingroup$
            Now I don't know how to transform data and cluster it to reach what I want.
            $endgroup$
            – code_worker
            Apr 1 at 1:59
















          • $begingroup$
            Now I don't know how to transform data and cluster it to reach what I want.
            $endgroup$
            – code_worker
            Apr 1 at 1:59















          $begingroup$
          Now I don't know how to transform data and cluster it to reach what I want.
          $endgroup$
          – code_worker
          Apr 1 at 1:59




          $begingroup$
          Now I don't know how to transform data and cluster it to reach what I want.
          $endgroup$
          – code_worker
          Apr 1 at 1:59

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48201%2fis-there-any-clustering-algorithm-to-find-longest-continuous-subsequences%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High