-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FlashInternImage models #2167
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
FYI the InternImage links in your PR, top of the implementation, and model class link to the Swin transformer paper. A few things looking at the cross attention implementation, projection bias would be simpler and match other models by setting Most other hierarchical models are structured with stages of blocks instead of blocks of layers, the change in naming threw me off at first. The InternImage paper also used the stages and blocks naming scheme. The seq_out forward for detection or segmentation is taken care of by feature extraction wrappers in timm, although I'm not sure if it is applicable here due to the way the cllip forwards and the potential for a throughput penalty. I'm not sure about this approach that's hardcoded into the model forward, but using the extraction feature on other models usually incurs a ~10% throughput penalty IME. |
Please merge this. FlashIntern is really good |
Thanks for the information provided. The issue with the link was a mistake I made when copying the link of the paper. As for the issues with some specific implementation, I need to double-check the code and try to contact the original author to determine if these implementations are for a special use. |
@rwightman @fffffgggg54 Hi, based on the information provided and some of my own thoughts, I have updated the code as follows:
If you find some other things that need improvement, please point them out in this PR, and I will manage to optimize the code impl based on them. |
@IridescentPig thanks for the PR and info... I took a brief look and quality is good, haven't had a chance to dig in... been trying to wrap up a few other models/features to get a release out so this will probably fall into next release if everything looks okay. Just a note on feature extraction though, for models with hierarchical feature maps, we do want the 'deepest feature at each feature map resolution' to be extracted by default, I've often ended up remapping models that don't make this easy (ie ones that have a downsample right at the end of a stage / block instead of at the start)... I also implemented a new feature EDIT: will probably look at merging & testing this + #2169 + maybe an initial MobileNetV4 as the next push after the one I'm currently wrapping up. |
Paper:
Adapted from official impl at https://github.com/OpenGVLab/DCNv4
Some clarifications:
pip install DCNv4
(this might take some time). This will raise a warning when users create a FlashInternModel with DCNv4 not available.torch.linspace()
, which will raise errors when creating a fx model of FlashInternImage. I managed to fix this but failed, so I exclude the FlashInternImage model from tests related to fx model.