Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The results without pre-training #2

Open
Gyann-z opened this issue Apr 27, 2022 · 13 comments
Open

The results without pre-training #2

Gyann-z opened this issue Apr 27, 2022 · 13 comments
Assignees

Comments

@Gyann-z
Copy link

Gyann-z commented Apr 27, 2022

Thanks for your implementation. Have you tried TextVQA training without the layout-aware pre-training? Can you reproduce the results of the paper? E.g., LaTr-base achieves 44.06 on Rosetta-en and 52.29 on Amazon-OCR.

@uakarsh
Copy link
Owner

uakarsh commented Apr 27, 2022

Actually, I want to try to reproduce the results of the paper, but the problems are:

  1. I don't have access to Amazon OCR (that could lead to deviated results, although IDL Dataset can be used for pre-training and I have added a script for the same in the examples section)
  2. While going through the paper, what I saw was, that they used heavy resources (like in ablation they mentioned the usage of 8 A100 GPUs), however I don't have such resources

However, as I have added the script pre-training portion, I would soon add the script for training on any custom dataset and hope that would help the community as well

And, I would surely update the repo and add weights, if I get any way to pre-train and fine-tune the model to achieve the results mentioned in the paper

Regards,

@MIL-VLG
Copy link

MIL-VLG commented May 2, 2022

Thanks for your wonderful work. Looking forward to the training scripts on downstream textvqa datasets!

@furkanbiten
Copy link

Hey @uakarsh,

I am the first author of the LaTr. And thank you for the implementation since I couldn't publish the code (well it is Amazon's code).

Here are some things I can offer though:

  1. After my internship, I was able to run Amazon-OCR on 26 Million pages of IDL, this can be found in my repo: https://github.com/furkanbiten/idl_data. This should give a reasonable improvement and very close to original numbers when used in pretraining.

  2. Unfortunately, I don't have the resources in my uni., so I can't offer the pretraining weights however; if someone wants to do that I can review the code and try to give the hyperparameters as much as I remember.

  3. You can ask or point me to any lines of code and I will try to answer as much as possible.

Sorry I can't do much more since it is mostly out of my hand.

PS: I will try to get the Amazon-OCR results on TextVQA and ST-VQA sometime soon hopefully.

@uakarsh
Copy link
Owner

uakarsh commented Jun 13, 2022

Hi @furkanbiten

Thanks for your reply and appreciating the work. Looking forward to have a great conversation with you regarding the same doubts.

For your points

  1. I have already included the pre-training script for IDL Dataset (definitely it is an amazing document dataset)

  2. I think, this would require frequent iterations and I would be doing that as well, since this is my first time implementing a Scene Text VQA, and along with a multi-modal transformer so it would be great as well. Hopefully, wish to get the desired results and a simple demo

  3. Would surely do that

I had two questions going on, so thought of asking

  1. Visual Question Answering can be considered as either of the two task, 1. Classification or 2. Generalization (i.e just like classifying into tokens and then decoding it). So, in LaTr, did they took it as 2nd category or 1st?

  2. In pre-training, they took spatial feature as well as image feature and concatinates it across dim = 1, while in the finetuning stage, they did the same for fine-tuning as well. Am I correct?

Regards,
Akarsh

@furkanbiten
Copy link

Hi again,

If you would like, we can move the discussion to a pinned issue so that it would get more visibility. Your call.

Here are the answers to the questions:

  1. First, let's set the terminology straight. You are correct on the first one as VQA people treat the problem as the classification problem. This has been the usual approach for many VQA papers (almost all as far as I know). Second one which we try to do is what we call "vocabulary-free" decoding. Now, even this name has problems since we DO use vocabulary however; what we meant was that we DO NOT use the 5k most common vocabulary built from the training set. Instead, we use the vocabulary from T5 which is SentencePiece vocabulary. So, we are closer to "Generalization" as you mentioned but technically, we are still using the vocabulary from SentencePiece but not the fixed vocabulary built from training set. A better name is certainly needed.

  2. I will assume couple of things but let me know if they are correct. By spatial features, I will assume you mean the embedding of bounding box features. In the pretraining and finetuning both, we embed bounding boxes of OCR (BB) with nn.Embedding (as you did correctly in the code) and then simply SUM BB features with word embeddings (obtained from T5 shared embeddings). (i) Now, we DO NOT use visual features in the pretraining. I think this is an important detail and maybe we missed to mention.
    (ii) In the finetuning phase, the concatenation happens with image features from ViT, question embeddings summed with BB embeddings (we select the BB for questions to be 0, 0, 1000, 1000) and OCR tokens embeddings summed with their corresponding BBs embeddings.

I hope this makes it a bit clear.

@uakarsh
Copy link
Owner

uakarsh commented Jun 14, 2022

Thank you for your detailed answer, a lot of things got clear, and yes you are right, I would be making a new issue, linking it with this issue's discussion. And, I am working right now, on making a step-by-step walkthrough of training LaTr on TextVQA, and hopefully, as we proceed, a lot of things will get clear in the way!!

Regards,
Akarsh

@furkanbiten
Copy link

Glad it helped! You can contact me or ask anything that is not clear or needs clarification at anytime.

@Gyann-z
Copy link
Author

Gyann-z commented Jul 13, 2022

Thank you for your responses and contributions. @uakarsh

@Gyann-z
Copy link
Author

Gyann-z commented Jul 13, 2022

Hey @furkanbiten

Thank you for your excellent work and detailed suggestions.

May I ask when you will release the Amazon-OCR results of TextVQA and ST-VQA datasets? I want to have a try.

@furkanbiten
Copy link

Hey @Gyann-z,
Thanks for the kind words, glad you liked the work.

I am actually trying to write my thesis and in the mean time trying to run Amazon-OCR on TextVQA and STVQA in my university since I couldn't get the data out of Amazon.
Of course, there are lots of errors running the OCR outside of Amazon and I am trying to fix them. Hopefully, in a week or two, I will be able to get it and create a new repo for it.

I will also ask @uakarsh to refer to the repo so that more people know about it.
For the moment, all I can say is stay tuned.

@Gyann-z
Copy link
Author

Gyann-z commented Jul 14, 2022

Thanks! Looking forward to getting Amazon-OCR results soon.

@furkanbiten
Copy link

Hey @Gyann-z @uakarsh,

I have some good news. I finally had the time to run the Amazon-OCR on STVQA and TextVQA.

I have created a repo where you can find the small code snippet and the raw json file returned from Amazon-OCR pipeline.

Here is the repo: https://github.com/furkanbiten/stvqa_amazon_ocr

Let me know if you guys have any problem.

@Gyann-z
Copy link
Author

Gyann-z commented Jul 19, 2022

Thank you very much! @furkanbiten That's really good news for me.

@uakarsh uakarsh self-assigned this Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants