-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object detection without predicting bounding box #608
Comments
Does the |
Thanks for the link. My question is more on how to modify a RCNN type network that takes in images and predicts a bounding box and label, to instead take in image and bounding box, and predict a label only. Would that just be simple image classification problem where we just focus the network's attention on the bounding box area? |
Aaah, gotcha! I think the simpler approach would be to just do image classification as you suggested. Crop the regions of the bboxes and feed that to a simple classifier. You can of course build a model that takes as input the image and the bbox-cordinates, but there is no current "easy" way of doing that. The parser itself will not change. Can I ask you how your pipeline looks like? I'm curious how you have the bboxes but is only interested in predicting labels. |
Thanks for the information. It is very helpful. I am doing an image-based human action classification. That means classifying actions using static images, no video. When the system is done, I expect that the model will take detections of humans from a typical object detector, then classify the human pose to some action label. So, to train the system, I got data from the ActivityNet challenge, AVA-Actions dataset. The dataset has label and bounding box annotations at specific timeframes for different actions. So, to get images, I take their videos, and get the annotated frames. This becomes the dataset that has images and bounding boxes. Needed a lot of cleaning, but I have about 500 images per class of interest for my work. I've actually tried two image classification models (fastai transfer learning with resnet and densenet). The first model uses the full image, and the second uses images cropped to the bounding box. The second works better than the first, but both suffer on the test set, so I know something is wrong somewhere. Hence, why I want to try if I can use image and bounding box as input and see how it does. Hope this helps. And thanks again for the help. |
Have you tried to train a "normal" detection model and see how it does?
Why does this first object detector does not output labels for the actions as well? |
For the first question: I did have that idea and that brought me here from fastai forums. When I trained the model, it predicts actions and bounding boxes but misses detections on the person of interest in my image. Since I care more about action than the person detection, I wanted to see if I could find a way to focus the detection on the bounding box, without cropping the image, if I can as it can distort the aspect ratio, and remove context information that may be helpful for the system. For the second question: I am making a modular system so each part can be finetuned separately. I already have a person detector running that works very well. So i'm just adding the action detection from pose portion. |
Got it! I'm afraid I can't offer much help here, the second point is definitely possible, you would need to add another input branch to your models that takes the bboxes coordinates, but since the number of inputs is variable you will need to get creative here. |
Thank you again for all the help. When I figure out the modification, I'll put in a PR tutorial for it. |
this link is not reachable |
@lennyjuma try this one: https://airctic.com/0.8.0/inference/ |
📓 New <Tutorial/Example>
Is this a request for a tutorial or for an example?
This is request for an example of how to setup the parser and model training for an object prediction setup. Some guidance of how to
What is the task?
Object Prediction. In this case, the training set is same as with object detection where images have bounding boxes. However, we are interested in predicting the label of a provided bounding box. We are not concerned with predicting the bounding box itself. Could I set this up using FasterRCNN or DETR by removing components that predict bounding boxes?
Is this example for a specific model?
Object detection is typically centered around finding bounding boxes and labels. Here I only want the labels, as bounding boxes are given.
Is this example for a specific dataset?
Any dataset, PASCAL-VOC, etc would do.
Don't remove
Main issue for examples: #39
The text was updated successfully, but these errors were encountered: