Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arwen is checkpoint progression outlier? #199

Open
hawkrobe opened this issue Jul 4, 2023 · 12 comments
Open

arwen is checkpoint progression outlier? #199

hawkrobe opened this issue Jul 4, 2023 · 12 comments

Comments

@hawkrobe
Copy link

hawkrobe commented Jul 4, 2023

I had a quick backchannel with @siddk, but was curious if anyone else had noticed that the Arwen seed is an extreme outlier in its checkpoint progression. We've been examining properties of attention matrices across the training trajectory, and noticed that at arwen's first checkpoint (checkpoint-10), its internal state and behavior looks almost exactly like the internal states and behavior that the 9 other seeds achieve significantly later, around checkpoint-4000. It made us wonder whether the checkpoint labeling scheme might be different for Arwen?

Some (internal) plots are attached as examples, but it shows up as an outlier on all metrics we've tried. The most dramatic example for us was the final plot, which shows a rather complex summary statistic computed on attention matrices across layers. It was striking to us how this very derived metric shows precisely the same profile across layers at the beginning as the other models do much later on, and also seems to be rather stable for Arwen up to that point, when it starts changing again.

We've checked carefully for bugs in our own code, and it's possible there's something we're missing, but we're running all the different models through the same pipeline with a fresh pull of the checkpoints, so it does seem to be a property of the checkpoints themselves. We're trying to determine whether the Arwen seed genuinely stumbled across this pattern extremely early on, which seems unlikely to be produced so quickly given learning rates and the relatively small number of observations up to that point. Or whether something may have gotten jumbled up with labels?

We're extremely grateful for MISTRAL as an incredible resource, and would very much appreciate any advice from others who have played with the checkpoints.

Accuracy on task (pdf)
Aggregated attention matrix statistic (pdf)
Layerwise attention matrix statistic (pdf)

@siddk
Copy link
Contributor

siddk commented Jul 4, 2023

CC @J38 @dlwh @Tiiiger and @lorr1; do y'all remember if other folks who've been doing interpretability work with Mistral checkpoints have run into this before?

@J38
Copy link
Contributor

J38 commented Jul 6, 2023

I don't see any evidence arwen is different from celebrimbor ... if you look at the loss curves they are very similar ... this seems to suggest there is some kind of labeling issue ...

@J38
Copy link
Contributor

J38 commented Jul 6, 2023

We should probably download the step-10 checkpoints for each model run and check the loss on wikitext ...

@J38
Copy link
Contributor

J38 commented Jul 6, 2023

So for whatever reason the arwen checkpoint for 10 steps is wrong ... I am not sure where that error occurred ... if you download the arwen checkpoint and the celebrimbor checkpoint they have wildly different losses ...

@J38
Copy link
Contributor

J38 commented Jul 6, 2023

The arwen step-10 checkpoint does not have a loss on wikitext or lambada consistent with the trainer_state logging ... I will spot sample some other checkpoints ...

@J38
Copy link
Contributor

J38 commented Jul 6, 2023

At some point all of these checkpoints were stored on Google Cloud (before we deleted them) ... when they were migrated to Hugging Face I did a random sample where I compared the checkpoint on HF to Google Cloud and none of the samples were a mismatch ...

@J38
Copy link
Contributor

J38 commented Jul 6, 2023

My basic analysis right now is something is off with the arwen checkpoints below 3000 (maybe even higher) ... it looks like after 3000 the checkpoints are having expected loss values ... the celebrimbor ones below 3000 seem fine ... hopefully this is isolated to the early checkpoints for arwen ...

@J38
Copy link
Contributor

J38 commented Jul 6, 2023

As I said before, I am not sure at one point in the process this issue emerged ... it's possible the original arwen checkpoints were incorrect or something happened in the copying and uploading to HF process ...

@siddk
Copy link
Contributor

siddk commented Jul 6, 2023

@J38 @dlwh - are the original checkpoints still in the GCP bucket? Can we try finding the originals somewhere? They also might be on the NLP cluster?

@hawkrobe
Copy link
Author

hawkrobe commented Jul 6, 2023

@J38 thanks so much for looking into this. it's a relief (on our end) to hear that the deviations from expected loss values pre-3000 are consistent with our observation of other properties pre-3000 (everything else seems to align after 3000).

@siddk
Copy link
Contributor

siddk commented Jul 6, 2023

Glad we're starting to get to the bottom of this. @hawkrobe - sorry that I didn't surface this sooner in the original email thread. Hopefully we still have the originals around, and can rectify this!

@J38
Copy link
Contributor

J38 commented Jul 7, 2023

They're deleted and I think you did it ... or me ... don't remember ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants