Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification regarding the implementation and training of LaTr #3

Open
uakarsh opened this issue Jun 21, 2022 · 7 comments
Open

Clarification regarding the implementation and training of LaTr #3

uakarsh opened this issue Jun 21, 2022 · 7 comments

Comments

@uakarsh
Copy link
Owner

uakarsh commented Jun 21, 2022

This thread contains the discussion of the implementation of LaTr with one of the authors of the same paper

The earlier discussion with the first author is mentioned here

@uakarsh
Copy link
Owner Author

uakarsh commented Jun 21, 2022

Hi @furkanbiten, although I had provided the script for VQA, there were a few questions, which I wanted to get myself cleared

Q.1 For the embedding, did you define the embedding layer separately or went with T5's encoder and decoder embedding (I went with seperate embeddings)

Q.2 As you told that this is a classification task, how did you formulate it, i.e did you take each sentence as a class, or converted each of the words of answer into a token and then padded it to appropriate length?

Thoughts about Q2: If we got with the second approach, while performing validation, there are few words which are not present in the training set, so I was not able to come up about how to deal with it

And, if we go about tokenizing each word and then padding the answers to a desired length (for me it was 512), won't the class labels be much high, similar to MLM (Mask Language Modeling pretraining. For me, it was around 37k classes (I took all the words from validation and training answers, and assigned an id to each of them), and I got an out of memory error (you can run and see the same in the examples/textvqa part 4 notebook)

What I was thinking was to take some K top words (following the second approach), and then for rest all words, assign an token. Not sure, if that would work or not, but maybe it could help atleast train the model and help observe the results

Thanks,

@furkanbiten
Copy link

Hey,

A.1 We went with T5's word embedding layer, more specifically self.shared in Huggingface T5 implementation.

A.2 First of all, The answer can be a sentence or a word or simply a number. You simply use the tokenizer from huggingface. Since T5 uses SentencePiece tokenizer, each answer will be tokenized accordingly. So you use this one:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")

or

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-base")

depending on which model weight you use to finetune.

So, with the tokenizer provided from huggingface, you don't need to construct a vocabulary or anything.

@uakarsh
Copy link
Owner Author

uakarsh commented Jun 30, 2022

Thanks for your clarification. I guess I have tried to train the model, however currently I got 23% validation accuracy

  • The kaggle notebook for the same is attached here
  • The metrics and the progress report are attached here

However, just a few differences between your paper's implementation and mine (referring to Pg. 12, Fine-tuning section)

  • The batch of yours is 25, and mine is 1 (definitely not even 2 could be used for training)
  • Steps for mine are 50K, while in your case it is 100K

All the other steps including the warm-up and linear decay have been taken care of.

Can you suggest something, which could improve the performance? The configuration for the run (i.e in V4 and V6 is the same), i.e you can see it here

It is really a great learning experience for me, so thanks to you and the other authors for this paper.

Future Step:

  • I am thinking of using the current weights to train for more than 50K steps to see the results (maybe next week, since my GPU Quota for Kaggle is over)

Update 1:

I did the above steps again and got almost the same validation accuracy

Update 2:

Was able to achieve 28 percent accuracy and hope that continuing it would increase it more.

@furkanbiten
Copy link

Hey @uakarsh,

Sorry for the late reply, been really busy lately.

The first thing to realize is that if you put batch size 1, roughly speaking you "need" to train 25 times more iterations.
So that your model will see the whole data the same amount. In other words, think in terms of epochs instead of iterations.
So, my guess is that simply training more should get you to a reasonable accuracy.

That said, you have to be careful with the batch size 1. In my experience, it is usually harder for the model to converge with smaller batch sizes. I understand the limitations of the resources but you can do gradient accumulation before calculating the gradients. You will need more time to train but at least will be easier to make the model to converge.

@uakarsh
Copy link
Owner Author

uakarsh commented Jul 15, 2022

Not an issue, take your time, I understand that your time is also valuable.

That's a really great idea, of accumulating gradient (did not think about it earlier). And, I made a demo of LaTr link. At least, it's amazing to observe the performance of the model on an unknown dataset.

And, would update here as soon as I get some new findings.

Cheers

@furkanbiten
Copy link

Cool! I will check out the demo.

Thanks.

@uakarsh
Copy link
Owner Author

uakarsh commented Dec 15, 2022

Hi @furkanbiten, in case where you did pre-training of LaTr, can you let me know, the idea about it?

By idea, I mean, did you take a subset of IDL Dataset, and then overfitted the entire dataset? Or similar to other training procedures, wherein we split the data into train and validation, followed by saving the checkpoints where the validation loss is minimum. (I have implemented pre-training task, but not sure how to train the model in such a setup)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants