Provide context for this repository #4

russellb · 2024-06-24T15:54:10Z

I can't find any explanation for why this repository was necessary. Maybe it's in a PR somewhere? The README points to the original repo, but doesn't explain why a fork is necessary.

The typical process for adopting a new repository under InstructLab is a proposal to the instructlab/dev-docs repository and getting it approved by the Oversight Committee. This shouldn't block any work, as all technical work could be done in a personal fork of the repo in the meantime.

Some comments in this issue with some background would help me provide guidance for what to do from here.

The text was updated successfully, but these errors were encountered:

russellb · 2024-06-24T16:05:41Z

The output to resolve this issue I'd like to see at a minimum:

an update to the README that explains the differences in this fork vs the upstream repo.
what it would take for this fork to go away and move back to using the upstream repo
some explanation of the process used to ensure we stay up to sync with the upstream repo where appropriate

mayank31398 · 2024-06-24T18:09:27Z

Upstream repo has a training framework for both pretraining and finetuning
Upstream repo supports a lot more model architectures and systems optimizations
All of this is not needed for ILAB
The current fork only has the relevant functionality (GPTDolomiteModel class) to use the Padding Free transformer optimization.

russellb · 2024-06-24T18:17:41Z

Upstream repo has a training framework for both pretraining and finetuning

Upstream repo supports a lot more model architectures and systems optimizations

All of this is not needed for ILAB

The current fork only has the relevant functionality (GPTDolomiteModel class) to use the Padding Free transformer optimization.

Only needing a subset of a library is incredibly common. That doesn't justify a fork.

I see from instructlab/training#55 that a blocking issue was that the original repo does not have published releases on pypi, which is a blocker for instructlab. That would be a reason for a temporary fork, at least, if we were unable to get the original library released quick enough for our needs. If it's just that, the fork could go away as soon as the upstream library makes its own releases.

I assume there's still more to it that I'm missing?

RobotSail · 2024-06-24T18:42:05Z

We need to publish GPTDolomite as a package and consume it in training. Since the dolomite-engine repo owned by IBM has more than just the model, we pulled out the model and created it in this repo.

russellb · 2024-06-24T18:46:42Z

We need to publish GPTDolomite as a package and consume it in training. Since the dolomite-engine repo owned by IBM has more than just the model, we pulled out the model and created it in this repo.

but if that repo had its own published releases, would that have solved the main issue?

or did the extra features in the upstream repo cause a problem of some kind? If so, are those problems tracked somewhere?

aldopareja · 2024-06-24T19:22:22Z

Let me layout a couple more reasons why we need this:

the upstream repository is an internal tool used by the pretraining team at IBM and they won't keep maintenance for our needs. We carved this out because otherwise maintainability will be a nightmare.
We won't be modifying this in the time being, it's already a pretty robust pipeline.
it makes it easier to expose this to the community, and they can contribute other import functions to build all kinds of models (all the building blocks are here), which gives a 20% boost in training.
it goes well with the philosophy of the training library, which should be a small form factor and extremely fast library (throughput-wise) where everything is exposed enough that it is easy to modify by anyone upstream. Not having this repository would not allow for that.

We will be including more documentation on what this repo is and how it fits into the training repository itself.

russellb · 2024-06-24T19:58:28Z

OK - my takeaway is that the upstream project is not usable by InstructLab:

not published as a library for others to use
no interest in doing so along with the maintenance and API stability that it would require?

So, we had no choice but to fork it and maintain our own version of the subset needed by InstructLab?

Is that about right?

RobotSail · 2024-06-24T23:15:05Z

Long-term we should come up with a better solution but for RHEL AI GA this will be fine.

russellb · 2024-06-25T00:51:18Z

no suggestion for an immediate change -- I just wanted to make sure the context was recorded so the reason for the fork is clear to those that come behind us. I'll propose a README update as a resolution to this issue.

RobotSail closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide context for this repository #4

Provide context for this repository #4

russellb commented Jun 24, 2024 •

edited

Loading

russellb commented Jun 24, 2024

mayank31398 commented Jun 24, 2024 •

edited

Loading

russellb commented Jun 24, 2024

RobotSail commented Jun 24, 2024

russellb commented Jun 24, 2024

aldopareja commented Jun 24, 2024

russellb commented Jun 24, 2024

RobotSail commented Jun 24, 2024

russellb commented Jun 25, 2024

Provide context for this repository #4

Provide context for this repository #4

Comments

russellb commented Jun 24, 2024 • edited Loading

russellb commented Jun 24, 2024

mayank31398 commented Jun 24, 2024 • edited Loading

russellb commented Jun 24, 2024

RobotSail commented Jun 24, 2024

russellb commented Jun 24, 2024

aldopareja commented Jun 24, 2024

russellb commented Jun 24, 2024

RobotSail commented Jun 24, 2024

russellb commented Jun 25, 2024

russellb commented Jun 24, 2024 •

edited

Loading

mayank31398 commented Jun 24, 2024 •

edited

Loading