Action Recognition for multiple objects and localization The Next CEO of Stack Overflow2019 Community Moderator ElectionTrajectory data mining and pattern recognition using ORB-SLAM and KNN-DTWInput and output feature shapes in CNN for speech recognitionConvnet training error does not decreaseVideo classification of birdsClustering/ Classifying users based on sequence of action and timeReframing action recognition as a reinforcement learning problemsamples for different objects with unique labelsActivity recognition with binary sensorsHow to count objects in ChainerCVExtracting metrics from multiple classes of clustered objects

Are the names of these months realistic?

Would a completely good Muggle be able to use a wand?

Inexact numbers as keys in Association?

Towers in the ocean; How deep can they be built?

What CSS properties can the br tag have?

Is "three point ish" an acceptable use of ish?

What steps are necessary to read a Modern SSD in Medieval Europe?

Calculate the Mean mean of two numbers

Lucky Feat: How can "more than one creature spend a luck point to influence the outcome of a roll"?

How to find image of a complex function with given constraints?

Is it convenient to ask the journal's editor for two additional days to complete a review?

How to properly draw diagonal line while using multicolumn inside tabular environment?

Strength of face-nailed connection for stair steps

Can I use the word “Senior” as part of a job title directly in German?

Why the last AS PATH item always is `I` or `?`?

New carbon wheel brake pads after use on aluminum wheel?

Where do students learn to solve polynomial equations these days?

Pulling the principal components out of a DimensionReducerFunction?

From jafe to El-Guest

"Eavesdropping" vs "Listen in on"

Getting Stale Gas Out of a Gas Tank w/out Dropping the Tank

0-rank tensor vs vector in 1D

Why don't programming languages automatically manage the synchronous/asynchronous problem?

Iterate through multiline string line by line



Action Recognition for multiple objects and localization



The Next CEO of Stack Overflow
2019 Community Moderator ElectionTrajectory data mining and pattern recognition using ORB-SLAM and KNN-DTWInput and output feature shapes in CNN for speech recognitionConvnet training error does not decreaseVideo classification of birdsClustering/ Classifying users based on sequence of action and timeReframing action recognition as a reinforcement learning problemsamples for different objects with unique labelsActivity recognition with binary sensorsHow to count objects in ChainerCVExtracting metrics from multiple classes of clustered objects










0












$begingroup$


I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.



When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.



How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?
I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.



Any advise from you? Maybe someone may explain me approach?



Thank you



Regards, Dmitry










share|improve this question









$endgroup$
















    0












    $begingroup$


    I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.



    When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.



    How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
    Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?
    I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.



    Any advise from you? Maybe someone may explain me approach?



    Thank you



    Regards, Dmitry










    share|improve this question









    $endgroup$














      0












      0








      0





      $begingroup$


      I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.



      When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.



      How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
      Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?
      I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.



      Any advise from you? Maybe someone may explain me approach?



      Thank you



      Regards, Dmitry










      share|improve this question









      $endgroup$




      I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.



      When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.



      How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
      Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?
      I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.



      Any advise from you? Maybe someone may explain me approach?



      Thank you



      Regards, Dmitry







      machine-learning classification object-detection activity-recognition






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 23 at 9:45









      DmitryDmitry

      31




      31




















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.



          Video action



          As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
            $endgroup$
            – Dmitry
            Mar 23 at 11:19










          • $begingroup$
            So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
            $endgroup$
            – thanatoz
            Mar 23 at 18:12












          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47833%2faction-recognition-for-multiple-objects-and-localization%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0












          $begingroup$

          So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.



          Video action



          As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
            $endgroup$
            – Dmitry
            Mar 23 at 11:19










          • $begingroup$
            So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
            $endgroup$
            – thanatoz
            Mar 23 at 18:12
















          0












          $begingroup$

          So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.



          Video action



          As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
            $endgroup$
            – Dmitry
            Mar 23 at 11:19










          • $begingroup$
            So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
            $endgroup$
            – thanatoz
            Mar 23 at 18:12














          0












          0








          0





          $begingroup$

          So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.



          Video action



          As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.






          share|improve this answer









          $endgroup$



          So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.



          Video action



          As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 23 at 10:07









          thanatozthanatoz

          467217




          467217











          • $begingroup$
            Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
            $endgroup$
            – Dmitry
            Mar 23 at 11:19










          • $begingroup$
            So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
            $endgroup$
            – thanatoz
            Mar 23 at 18:12

















          • $begingroup$
            Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
            $endgroup$
            – Dmitry
            Mar 23 at 11:19










          • $begingroup$
            So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
            $endgroup$
            – thanatoz
            Mar 23 at 18:12
















          $begingroup$
          Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
          $endgroup$
          – Dmitry
          Mar 23 at 11:19




          $begingroup$
          Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
          $endgroup$
          – Dmitry
          Mar 23 at 11:19












          $begingroup$
          So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
          $endgroup$
          – thanatoz
          Mar 23 at 18:12





          $begingroup$
          So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
          $endgroup$
          – thanatoz
          Mar 23 at 18:12


















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47833%2faction-recognition-for-multiple-objects-and-localization%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown