Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide context for this repository #4

Closed
russellb opened this issue Jun 24, 2024 · 9 comments
Closed

Provide context for this repository #4

russellb opened this issue Jun 24, 2024 · 9 comments

Comments

@russellb
Copy link
Member

russellb commented Jun 24, 2024

I can't find any explanation for why this repository was necessary. Maybe it's in a PR somewhere? The README points to the original repo, but doesn't explain why a fork is necessary.

The typical process for adopting a new repository under InstructLab is a proposal to the instructlab/dev-docs repository and getting it approved by the Oversight Committee. This shouldn't block any work, as all technical work could be done in a personal fork of the repo in the meantime.

Some comments in this issue with some background would help me provide guidance for what to do from here.

@russellb
Copy link
Member Author

The output to resolve this issue I'd like to see at a minimum:

  1. an update to the README that explains the differences in this fork vs the upstream repo.
  2. what it would take for this fork to go away and move back to using the upstream repo
  3. some explanation of the process used to ensure we stay up to sync with the upstream repo where appropriate

@mayank31398
Copy link

mayank31398 commented Jun 24, 2024

  1. Upstream repo has a training framework for both pretraining and finetuning
  2. Upstream repo supports a lot more model architectures and systems optimizations
  3. All of this is not needed for ILAB
  4. The current fork only has the relevant functionality (GPTDolomiteModel class) to use the Padding Free transformer optimization.

@russellb
Copy link
Member Author

  1. Upstream repo has a training framework for both pretraining and finetuning
  2. Upstream repo supports a lot more model architectures and systems optimizations
  3. All of this is not needed for ILAB
  4. The current fork only has the relevant functionality (GPTDolomiteModel class) to use the Padding Free transformer optimization.

Only needing a subset of a library is incredibly common. That doesn't justify a fork.

I see from instructlab/training#55 that a blocking issue was that the original repo does not have published releases on pypi, which is a blocker for instructlab. That would be a reason for a temporary fork, at least, if we were unable to get the original library released quick enough for our needs. If it's just that, the fork could go away as soon as the upstream library makes its own releases.

I assume there's still more to it that I'm missing?

@RobotSail
Copy link
Member

We need to publish GPTDolomite as a package and consume it in training. Since the dolomite-engine repo owned by IBM has more than just the model, we pulled out the model and created it in this repo.

@russellb
Copy link
Member Author

We need to publish GPTDolomite as a package and consume it in training. Since the dolomite-engine repo owned by IBM has more than just the model, we pulled out the model and created it in this repo.

but if that repo had its own published releases, would that have solved the main issue?

or did the extra features in the upstream repo cause a problem of some kind? If so, are those problems tracked somewhere?

@aldopareja
Copy link
Member

Let me layout a couple more reasons why we need this:

  1. the upstream repository is an internal tool used by the pretraining team at IBM and they won't keep maintenance for our needs. We carved this out because otherwise maintainability will be a nightmare.
  2. We won't be modifying this in the time being, it's already a pretty robust pipeline.
  3. it makes it easier to expose this to the community, and they can contribute other import functions to build all kinds of models (all the building blocks are here), which gives a 20% boost in training.
  4. it goes well with the philosophy of the training library, which should be a small form factor and extremely fast library (throughput-wise) where everything is exposed enough that it is easy to modify by anyone upstream. Not having this repository would not allow for that.

We will be including more documentation on what this repo is and how it fits into the training repository itself.

@russellb
Copy link
Member Author

OK - my takeaway is that the upstream project is not usable by InstructLab:

  • not published as a library for others to use
  • no interest in doing so along with the maintenance and API stability that it would require?

So, we had no choice but to fork it and maintain our own version of the subset needed by InstructLab?

Is that about right?

@RobotSail
Copy link
Member

Long-term we should come up with a better solution but for RHEL AI GA this will be fine.

@russellb
Copy link
Member Author

no suggestion for an immediate change -- I just wanted to make sure the context was recorded so the reason for the fork is clear to those that come behind us. I'll propose a README update as a resolution to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants