Skip to content

Commit

Permalink
Embeddings finetuning (#124)
Browse files Browse the repository at this point in the history
* add embeddings finetuning quick and dirty code

* formatting

* add embeddings finetuning pipeline

* refactor

* works for multi-gpu setup

* update gitignore

* scratch test in notebook

* pipeline is working again

* no need for custom class

* visualize performance comparison

* compare apples to apples

* fix viz bug and strip out code / logs examples

* update print statement

* mini updates

* add dummy embedding pipeline to compare against

* updates to collate

* add dummy pipeline and refactor

* new pipelines

* run locally, fix attribute naming error

* full run

* next experiment

* Add ollama ignores to .gitignore

* add retries to question generation

* generate 3 sets of questions

* handle multiple question generation

* add retries for question generation function

* more robustness around failures

* formatting

* fix dataloader splitting (on newline)

* inspect test dataloader construction WIP

* split chunks by headers

* strip out boilerplate at top of markdown docs

* refactor & fix chunking logic

* add artifact metadata logging

* fix annotations and artifact metadata logging

* update metadata

* update logging statements

* add time estimates to logging

* log estimated completion time

* update log statements

* add rate for logging

* Update generation rate message

* fix eval

* limited run + limited eval

* full eval

* distilroberta as base model

* smaller batch size for a larger base model

* fix eval mismatch and chunking

* chore: Update .gitignore to exclude .flashrank_cache file

* chore: Add notebook for argilla

* chore: Add .gitignore entry for bge-base-financial-matryoshka

* Update sentence-transformers to version 3 and add transformers dependency

* add data and get finetuning working

* chore: Add .gitignore entry for embeddings

* Update zenml dependency to version 0.58.2 with server support

* chore: Update dependencies and add .gitignore entry for embeddings

* Add comments Argilla - David Berenstein (#119)

* Add introduction to argilla_embeddings

* Update context for data and synthetic query generation

* Added argilla data section

* embeddings updates

* update gitignore

* format

* add push to HF imports correctly

* remove extra connection cell

* add final step to pipeline as active

* fix typo

* updates

* Feature/finetune embeddings (#120)

* Add introduction to argilla_embeddings

* Update context for data and synthetic query generation

* Added argilla data section

* Update Argilla code

* Remove cell outputs

* Remove secrets

* Remove duplicate Argilla section

* Update dataset name

* Update phrasing Argilla

* Updatge merge conflicts

* fixes for dataset generation

* Update vector naming (#121)

* Add introduction to argilla_embeddings

* Update context for data and synthetic query generation

* Added argilla data section

* Update Argilla code

* Remove cell outputs

* Remove secrets

* Remove duplicate Argilla section

* Update dataset name

* Update phrasing Argilla

* Updatge merge conflicts

* Update vector naming

* working argilla upload

* fix argilla upload

* add wandb requirement

* add finetuning step to pipeline

* ignore local model files

* working finetune step

* finetune step pushes to the hub now

* update gitignore

* update pipeline

* remove wandb tracking

* broken version

* remove automatic wandb tracking

* return results

* try with larger base model

* add new model to gitignore

* use snowflake model

* add model registration

* remove constant

* add model throughout both pipelines

* finetuning steps in logical order now

* add new constant

* update model version

* use constant for model name

* various improvements :)

* add argilla flywheel functionality

* refactor out a constant for the dataset name

* add license

* more small changes

* Various improvements (#123)

* Resolve argilla client default workspace warning

* Add constant for other pipeline steps

* Add constant for finetune_embeddings

* Remove step embeddings constants

* import contants in pipeline directly

* Remove constants from distilabel_generation

* Add workflow for basic dataset filtering

* Add updated plots

---------

Co-authored-by: Alex Strick van Linschoten <[email protected]>

* format and remove unused imports

* fix push step

* fixed finetuning step

* use github repo to install for latest changes

* update README

* Delete llm-complete-guide/__init__.py

* Update llm-complete-guide/requirements.txt

* update README

* format and add results visualization

* remove extra comma

* add visualization step

* visualization corrections

* make distilabel imports clearer

* update README docs

* address TODO comment

* credit phil

* fix typo

* add actual values to chart

* add link to LLMOps guide

* make visualization a bit nicer

* update main README

---------

Co-authored-by: David Berenstein <[email protected]>
  • Loading branch information
strickvl and davidberenstein1957 authored Aug 8, 2024
1 parent 2bbac4e commit a0eee84
Show file tree
Hide file tree
Showing 35 changed files with 5,244 additions and 26 deletions.
16 changes: 14 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,18 @@ zencoder/cloned_public_repos
llm-lora-finetuning/ckpt/
llm-lora-finetuning/data_generation/
llm-lora-finetuning/datagen/
nohup.out
fiftyone-ls-demo/
llm-lora-finetuning/mistral-zenml-finetune/
.flashrank_cache

bge-base-financial-matryoshka/
embeddings
llm-lora-finetuning/meta-llama/
llm-lora-finetuning/microsoft/
llm-lora-finetuning/unsloth/
llm-lora-finetuning/configs/shopify.yaml
finetuned-matryoshka/
finetuned-all-MiniLM-L6-v2/
finetuned-snowflake-arctic-embed-m/

# ollama ignores
nohup.out
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ A list of updated and maintained projects by the ZenML team and the community:
| [LLM RAG Pipeline with Langchain and OpenAI](llm-agents/) | NLP, LLMs | `slack` `langchain` `llama_index` |
| [Orbit User Analysis](orbit-user-analysis) | Data Analysis, Tabular | - |
| [Huggingface to Sagemaker](huggingface-sagemaker) | NLP | `pytorch` `mlflow` `huggingface` `aws` `s3` `kubeflow` `slack` `github` |
| [Complete Guide to LLMs (from RAG to finetuning)](llm-complete-guide) | NLP, LLMs | `openai` `supabase` |
| [Complete Guide to LLMs (from RAG to finetuning)](llm-complete-guide) | NLP, LLMs, embeddings, finetuning | `openai` `supabase` `huggingface` `argilla` |
| [LLM LoRA Finetuning (Phi3 and Llama 3.1)](llm-lora-finetuning) | NLP, LLMs | `gcp` |
| [ECP Price Prediction with GCP Cloud Composer](airflow-cloud-composer-etl-feature-train/) | Regression, Airflow | `cloud-composer` `airflow` |

Expand Down
48 changes: 47 additions & 1 deletion llm-complete-guide/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ Note that Claude will require a different API key from Anthropic. See [the
`litellm` docs](https://docs.litellm.ai/docs/providers/anthropic) on how to set
this up.

### Run the evaluation pipeline
### Run the LLM RAG evaluation pipeline

To run the evaluation pipeline, you can use the following command:

Expand All @@ -127,6 +127,52 @@ python run.py --evaluation
You'll need to have first run the RAG pipeline to have the necessary assets in
the database to evaluate.

## Embeddings finetuning

For embeddings finetuning we first generate synthetic data and then finetune the
embeddings. Both of these pipelines are described in [the LLMOps guide](https://docs.zenml.io/v/docs/user-guide/llmops-guide/finetuning-embeddings) and
instructions for how to run them are provided below.

### Run the `distilabel` synthetic data generation pipeline

To run the `distilabel` synthetic data generation pipeline, you can use the following commands:

```shell
pip install -r requirements-argilla.txt # special requirements
python run.py --synthetic
```

You will also need to have set up and connected to an Argilla instance for this
to work. Please follow the instructions in the [Argilla
documentation](https://docs.argilla.io/latest/getting_started/quickstart/)
to set up and connect to an Argilla instance on the Hugging Face Hub. [ZenML's
Argilla integration
documentation](https://docs.zenml.io/v/docs/stack-components/annotators/argilla)
will guide you through the process of connecting to your instance as a stack
component.

### Finetune the embeddings

To run the pipeline for finetuning the embeddings, you can use the following
commands:

```shell
pip install -r requirements-argilla.txt # special requirements
python run.py --embeddings
```

As with the previous pipeline, you will need to have set up and connected to an Argilla instance for this
to work. Please follow the instructions in the [Argilla
documentation](https://docs.argilla.io/latest/getting_started/quickstart/)
to set up and connect to an Argilla instance on the Hugging Face Hub. [ZenML's
Argilla integration
documentation](https://docs.zenml.io/v/docs/stack-components/annotators/argilla)
will guide you through the process of connecting to your instance as a stack
component.

*Credit to Phil Schmid for his [tutorial on embeddings finetuning with Matryoshka
loss function](https://www.philschmid.de/fine-tune-embedding-model-for-rag) which we adapted for this project.*

## ☁️ Running in your own VPC

The basic RAG pipeline will run using a local stack, but if you want to improve
Expand Down
Empty file removed llm-complete-guide/__init__.py
Empty file.
41 changes: 40 additions & 1 deletion llm-complete-guide/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
# limitations under the License.
#


# Vector Store constants
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 50
Expand All @@ -35,3 +34,43 @@
"claude3": "claude-3-opus-20240229",
"claudehaiku": "claude-3-haiku-20240307",
}

# CHUNKING_METHOD = "split-by-document"
CHUNKING_METHOD = "split-by-header"
DATASET_NAME = f"zenml/rag_qa_embedding_questions_{CHUNKING_METHOD}"
MODEL_PATH = "all-MiniLM-L6-v2"
# MODEL_PATH = "embedding-data/distilroberta-base-sentence-transformer"
NUM_EPOCHS = 30
WARMUP_STEPS = 0.1 # 10% of train data
NUM_GENERATIONS = 2
EVAL_BATCH_SIZE = 64

DUMMY_DATASET_NAME = "embedding-data/sentence-compression"
# DUMMY_MODEL_PATH = "embedding-data/distilroberta-base-sentence-transformer"
DUMMY_MODEL_PATH = "all-MiniLM-L6-v2"
DUMMY_EPOCHS = 10

# Markdown Loader constants
FILES_TO_IGNORE = [
"toc.md",
]

# embeddings finetuning constants
EMBEDDINGS_MODEL_NAME_ZENML = "finetuned-zenml-docs-embeddings"
DATASET_NAME_DEFAULT = "zenml/rag_qa_embedding_questions_0_60_0"
DATASET_NAME_DISTILABEL = f"{DATASET_NAME_DEFAULT}_distilabel"
DATASET_NAME_ARGILLA = DATASET_NAME_DEFAULT.replace("zenml/", "")
OPENAI_MODEL_GEN = "gpt-4o"
OPENAI_MODEL_GEN_KWARGS_EMBEDDINGS = {
"temperature": 0.7,
"max_new_tokens": 512,
}
EMBEDDINGS_MODEL_ID_BASELINE = "Snowflake/snowflake-arctic-embed-m"
EMBEDDINGS_MODEL_ID_FINE_TUNED = "finetuned-snowflake-arctic-embed-m"
EMBEDDINGS_MODEL_MATRYOSHKA_DIMS: list[int] = [
384,
256,
128,
64,
] # Important: large to small
USE_ARGILLA_ANNOTATIONS = False
166 changes: 166 additions & 0 deletions llm-complete-guide/data/test_dataset.json

Large diffs are not rendered by default.

Loading

0 comments on commit a0eee84

Please sign in to comment.