Action Recognition for multiple objects and localization The Next CEO of Stack Overflow2019 Community Moderator ElectionTrajectory data mining and pattern recognition using ORB-SLAM and KNN-DTWInput and output feature shapes in CNN for speech recognitionConvnet training error does not decreaseVideo classification of birdsClustering/ Classifying users based on sequence of action and timeReframing action recognition as a reinforcement learning problemsamples for different objects with unique labelsActivity recognition with binary sensorsHow to count objects in ChainerCVExtracting metrics from multiple classes of clustered objects
Are the names of these months realistic?
Would a completely good Muggle be able to use a wand?
Inexact numbers as keys in Association?
Towers in the ocean; How deep can they be built?
What CSS properties can the br tag have?
Is "three point ish" an acceptable use of ish?
What steps are necessary to read a Modern SSD in Medieval Europe?
Calculate the Mean mean of two numbers
Lucky Feat: How can "more than one creature spend a luck point to influence the outcome of a roll"?
How to find image of a complex function with given constraints?
Is it convenient to ask the journal's editor for two additional days to complete a review?
How to properly draw diagonal line while using multicolumn inside tabular environment?
Strength of face-nailed connection for stair steps
Can I use the word “Senior” as part of a job title directly in German?
Why the last AS PATH item always is `I` or `?`?
New carbon wheel brake pads after use on aluminum wheel?
Where do students learn to solve polynomial equations these days?
Pulling the principal components out of a DimensionReducerFunction?
From jafe to El-Guest
"Eavesdropping" vs "Listen in on"
Getting Stale Gas Out of a Gas Tank w/out Dropping the Tank
0-rank tensor vs vector in 1D
Why don't programming languages automatically manage the synchronous/asynchronous problem?
Iterate through multiline string line by line
Action Recognition for multiple objects and localization
The Next CEO of Stack Overflow2019 Community Moderator ElectionTrajectory data mining and pattern recognition using ORB-SLAM and KNN-DTWInput and output feature shapes in CNN for speech recognitionConvnet training error does not decreaseVideo classification of birdsClustering/ Classifying users based on sequence of action and timeReframing action recognition as a reinforcement learning problemsamples for different objects with unique labelsActivity recognition with binary sensorsHow to count objects in ChainerCVExtracting metrics from multiple classes of clustered objects
$begingroup$
I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.
When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.
How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?
I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.
Any advise from you? Maybe someone may explain me approach?
Thank you
Regards, Dmitry
machine-learning classification object-detection activity-recognition
$endgroup$
add a comment |
$begingroup$
I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.
When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.
How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?
I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.
Any advise from you? Maybe someone may explain me approach?
Thank you
Regards, Dmitry
machine-learning classification object-detection activity-recognition
$endgroup$
add a comment |
$begingroup$
I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.
When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.
How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?
I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.
Any advise from you? Maybe someone may explain me approach?
Thank you
Regards, Dmitry
machine-learning classification object-detection activity-recognition
$endgroup$
I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.
When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.
How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network?
I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.
Any advise from you? Maybe someone may explain me approach?
Thank you
Regards, Dmitry
machine-learning classification object-detection activity-recognition
machine-learning classification object-detection activity-recognition
asked Mar 23 at 9:45
DmitryDmitry
31
31
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.

As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.
$endgroup$
$begingroup$
Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
$endgroup$
– Dmitry
Mar 23 at 11:19
$begingroup$
So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
$endgroup$
– thanatoz
Mar 23 at 18:12
add a comment |
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47833%2faction-recognition-for-multiple-objects-and-localization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.

As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.
$endgroup$
$begingroup$
Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
$endgroup$
– Dmitry
Mar 23 at 11:19
$begingroup$
So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
$endgroup$
– thanatoz
Mar 23 at 18:12
add a comment |
$begingroup$
So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.

As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.
$endgroup$
$begingroup$
Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
$endgroup$
– Dmitry
Mar 23 at 11:19
$begingroup$
So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
$endgroup$
– thanatoz
Mar 23 at 18:12
add a comment |
$begingroup$
So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.

As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.
$endgroup$
So finding actions from videos happens to be a tricky task. I have no idea about temporal 3D convnet but in order to tackle a problem like this, I would couple the CNN layer on individual frames of video and then feed the frame timeline into another layer of LSTM in order to find the context of the video.

As the action being performed on the video covers anywhere from 10% to 90% of the frame, you can perform TestTimeAugmentation on the video in order to find the action with a higher confidence rate. Similar approach could be found in this video by Google.
answered Mar 23 at 10:07
thanatozthanatoz
467217
467217
$begingroup$
Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
$endgroup$
– Dmitry
Mar 23 at 11:19
$begingroup$
So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
$endgroup$
– thanatoz
Mar 23 at 18:12
add a comment |
$begingroup$
Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
$endgroup$
– Dmitry
Mar 23 at 11:19
$begingroup$
So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
$endgroup$
– thanatoz
Mar 23 at 18:12
$begingroup$
Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
$endgroup$
– Dmitry
Mar 23 at 11:19
$begingroup$
Thank you for response. I've considered using cnn+lstm. But here i also lack of understanding. First thing is how to stitch together cnn and lstm? Should use end-to-end approach or train networks separately? If separately, how should i pass features from cnn to lstm?
$endgroup$
– Dmitry
Mar 23 at 11:19
$begingroup$
So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
$endgroup$
– thanatoz
Mar 23 at 18:12
$begingroup$
So this is where your mathematics and deep learning concepts will come handy. The way how you model a CNN together with RNN or LSTM depends on the framework you are using. In Keras, refer to the docs of the functional model. Hope it helps.
$endgroup$
– thanatoz
Mar 23 at 18:12
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47833%2faction-recognition-for-multiple-objects-and-localization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
