Embeddings finetuning (#124)

* add embeddings finetuning quick and dirty code * formatting * add embeddings finetuning pipeline * refactor * works for multi-gpu setup * update gitignore * scratch test in notebook * pipeline is working again * no need for custom class * visualize performance comparison * compare apples to apples * fix viz bug and strip out code / logs examples * update print statement * mini updates * add dummy embedding pipeline to compare against * updates to collate * add dummy pipeline and refactor * new pipelines * run locally, fix attribute naming error * full run * next experiment * Add ollama ignores to .gitignore * add retries to question generation * generate 3 sets of questions * handle multiple question generation * add retries for question generation function * more robustness around failures * formatting * fix dataloader splitting (on newline) * inspect test dataloader construction WIP * split chunks by headers * strip out boilerplate at top of markdown docs * refactor & fix chunking logic * add artifact metadata logging * fix annotations and artifact metadata logging * update metadata * update logging statements * add time estimates to logging * log estimated completion time * update log statements * add rate for logging * Update generation rate message * fix eval * limited run + limited eval * full eval * distilroberta as base model * smaller batch size for a larger base model * fix eval mismatch and chunking * chore: Update .gitignore to exclude .flashrank_cache file * chore: Add notebook for argilla * chore: Add .gitignore entry for bge-base-financial-matryoshka * Update sentence-transformers to version 3 and add transformers dependency * add data and get finetuning working * chore: Add .gitignore entry for embeddings * Update zenml dependency to version 0.58.2 with server support * chore: Update dependencies and add .gitignore entry for embeddings * Add comments Argilla - David Berenstein (#119) * Add introduction to argilla_embeddings * Update context for data and synthetic query generation * Added argilla data section * embeddings updates * update gitignore * format * add push to HF imports correctly * remove extra connection cell * add final step to pipeline as active * fix typo * updates * Feature/finetune embeddings (#120) * Add introduction to argilla_embeddings * Update context for data and synthetic query generation * Added argilla data section * Update Argilla code * Remove cell outputs * Remove secrets * Remove duplicate Argilla section * Update dataset name * Update phrasing Argilla * Updatge merge conflicts * fixes for dataset generation * Update vector naming (#121) * Add introduction to argilla_embeddings * Update context for data and synthetic query generation * Added argilla data section * Update Argilla code * Remove cell outputs * Remove secrets * Remove duplicate Argilla section * Update dataset name * Update phrasing Argilla * Updatge merge conflicts * Update vector naming * working argilla upload * fix argilla upload * add wandb requirement * add finetuning step to pipeline * ignore local model files * working finetune step * finetune step pushes to the hub now * update gitignore * update pipeline * remove wandb tracking * broken version * remove automatic wandb tracking * return results * try with larger base model * add new model to gitignore * use snowflake model * add model registration * remove constant * add model throughout both pipelines * finetuning steps in logical order now * add new constant * update model version * use constant for model name * various improvements :) * add argilla flywheel functionality * refactor out a constant for the dataset name * add license * more small changes * Various improvements (#123) * Resolve argilla client default workspace warning * Add constant for other pipeline steps * Add constant for finetune_embeddings * Remove step embeddings constants * import contants in pipeline directly * Remove constants from distilabel_generation * Add workflow for basic dataset filtering * Add updated plots --------- Co-authored-by: Alex Strick van Linschoten <[email protected]> * format and remove unused imports * fix push step * fixed finetuning step * use github repo to install for latest changes * update README * Delete llm-complete-guide/__init__.py * Update llm-complete-guide/requirements.txt * update README * format and add results visualization * remove extra comma * add visualization step * visualization corrections * make distilabel imports clearer * update README docs * address TODO comment * credit phil * fix typo * add actual values to chart * add link to LLMOps guide * make visualization a bit nicer * update main README --------- Co-authored-by: David Berenstein <[email protected]>
zenml-io · Aug 8, 2024 · a0eee84 · a0eee84
1 parent 2bbac4e
commit a0eee84
Show file tree

Hide file tree

Showing 35 changed files with 5,244 additions and 26 deletions.
diff --git a/.gitignore b/.gitignore
@@ -150,6 +150,18 @@ zencoder/cloned_public_repos
 llm-lora-finetuning/ckpt/
 llm-lora-finetuning/data_generation/
 llm-lora-finetuning/datagen/
-nohup.out
+fiftyone-ls-demo/
+llm-lora-finetuning/mistral-zenml-finetune/
 .flashrank_cache
-
+bge-base-financial-matryoshka/
+embeddings
+llm-lora-finetuning/meta-llama/
+llm-lora-finetuning/microsoft/
+llm-lora-finetuning/unsloth/
+llm-lora-finetuning/configs/shopify.yaml
+finetuned-matryoshka/
+finetuned-all-MiniLM-L6-v2/
+finetuned-snowflake-arctic-embed-m/
+
+# ollama ignores
+nohup.out
diff --git a/README.md b/README.md
@@ -73,7 +73,7 @@ A list of updated and maintained projects by the ZenML team and the community:
 | [LLM RAG Pipeline with Langchain and OpenAI](llm-agents/)                    | NLP, LLMs                             | `slack` `langchain` `llama_index`                                        |
 | [Orbit User Analysis](orbit-user-analysis)                             | Data Analysis, Tabular                | -                                                                        |
 | [Huggingface to Sagemaker](huggingface-sagemaker)                      | NLP                                   | `pytorch` `mlflow` `huggingface` `aws` `s3` `kubeflow` `slack` `github`  |
-| [Complete Guide to LLMs (from RAG to finetuning)](llm-complete-guide)               | NLP, LLMs                           | `openai` `supabase`  |
+| [Complete Guide to LLMs (from RAG to finetuning)](llm-complete-guide)               | NLP, LLMs, embeddings, finetuning                           | `openai` `supabase` `huggingface` `argilla`  |
 | [LLM LoRA Finetuning (Phi3 and Llama 3.1)](llm-lora-finetuning)               | NLP, LLMs                           | `gcp`  |
 | [ECP Price Prediction with GCP Cloud Composer](airflow-cloud-composer-etl-feature-train/)               | Regression, Airflow                           | `cloud-composer` `airflow` |
 

diff --git a/llm-complete-guide/README.md b/llm-complete-guide/README.md
@@ -116,7 +116,7 @@ Note that Claude will require a different API key from Anthropic. See [the
 `litellm` docs](https://docs.litellm.ai/docs/providers/anthropic) on how to set
 this up.
 
-### Run the evaluation pipeline
+### Run the LLM RAG evaluation pipeline
 
 To run the evaluation pipeline, you can use the following command:
 
@@ -127,6 +127,52 @@ python run.py --evaluation
 You'll need to have first run the RAG pipeline to have the necessary assets in
 the database to evaluate.
 
+## Embeddings finetuning
+
+For embeddings finetuning we first generate synthetic data and then finetune the
+embeddings. Both of these pipelines are described in [the LLMOps guide](https://docs.zenml.io/v/docs/user-guide/llmops-guide/finetuning-embeddings) and
+instructions for how to run them are provided below.
+
+### Run the `distilabel` synthetic data generation pipeline
+
+To run the `distilabel` synthetic data generation pipeline, you can use the following commands:
+
+```shell
+pip install -r requirements-argilla.txt # special requirements
+python run.py --synthetic
+```
+
+You will also need to have set up and connected to an Argilla instance for this
+to work. Please follow the instructions in the [Argilla
+documentation](https://docs.argilla.io/latest/getting_started/quickstart/)
+to set up and connect to an Argilla instance on the Hugging Face Hub. [ZenML's
+Argilla integration
+documentation](https://docs.zenml.io/v/docs/stack-components/annotators/argilla)
+will guide you through the process of connecting to your instance as a stack
+component.
+
+### Finetune the embeddings
+
+To run the pipeline for finetuning the embeddings, you can use the following
+commands:
+
+```shell
+pip install -r requirements-argilla.txt # special requirements
+python run.py --embeddings
+```
+
+As with the previous pipeline, you will need to have set up and connected to an Argilla instance for this
+to work. Please follow the instructions in the [Argilla
+documentation](https://docs.argilla.io/latest/getting_started/quickstart/)
+to set up and connect to an Argilla instance on the Hugging Face Hub. [ZenML's
+Argilla integration
+documentation](https://docs.zenml.io/v/docs/stack-components/annotators/argilla)
+will guide you through the process of connecting to your instance as a stack
+component.
+
+*Credit to Phil Schmid for his [tutorial on embeddings finetuning with Matryoshka
+loss function](https://www.philschmid.de/fine-tune-embedding-model-for-rag) which we adapted for this project.*
+
 ## ☁️ Running in your own VPC
 
 The basic RAG pipeline will run using a local stack, but if you want to improve

diff --git a/llm-complete-guide/__init__.py b/llm-complete-guide/__init__.py
diff --git a/llm-complete-guide/constants.py b/llm-complete-guide/constants.py
@@ -15,7 +15,6 @@
 # limitations under the License.
 #
 
-
 # Vector Store constants
 CHUNK_SIZE = 2000
 CHUNK_OVERLAP = 50
@@ -35,3 +34,43 @@
     "claude3": "claude-3-opus-20240229",
     "claudehaiku": "claude-3-haiku-20240307",
 }
+
+# CHUNKING_METHOD = "split-by-document"
+CHUNKING_METHOD = "split-by-header"
+DATASET_NAME = f"zenml/rag_qa_embedding_questions_{CHUNKING_METHOD}"
+MODEL_PATH = "all-MiniLM-L6-v2"
+# MODEL_PATH = "embedding-data/distilroberta-base-sentence-transformer"
+NUM_EPOCHS = 30
+WARMUP_STEPS = 0.1  # 10% of train data
+NUM_GENERATIONS = 2
+EVAL_BATCH_SIZE = 64
+
+DUMMY_DATASET_NAME = "embedding-data/sentence-compression"
+# DUMMY_MODEL_PATH = "embedding-data/distilroberta-base-sentence-transformer"
+DUMMY_MODEL_PATH = "all-MiniLM-L6-v2"
+DUMMY_EPOCHS = 10
+
+# Markdown Loader constants
+FILES_TO_IGNORE = [
+    "toc.md",
+]
+
+# embeddings finetuning constants
+EMBEDDINGS_MODEL_NAME_ZENML = "finetuned-zenml-docs-embeddings"
+DATASET_NAME_DEFAULT = "zenml/rag_qa_embedding_questions_0_60_0"
+DATASET_NAME_DISTILABEL = f"{DATASET_NAME_DEFAULT}_distilabel"
+DATASET_NAME_ARGILLA = DATASET_NAME_DEFAULT.replace("zenml/", "")
+OPENAI_MODEL_GEN = "gpt-4o"
+OPENAI_MODEL_GEN_KWARGS_EMBEDDINGS = {
+    "temperature": 0.7,
+    "max_new_tokens": 512,
+}
+EMBEDDINGS_MODEL_ID_BASELINE = "Snowflake/snowflake-arctic-embed-m"
+EMBEDDINGS_MODEL_ID_FINE_TUNED = "finetuned-snowflake-arctic-embed-m"
+EMBEDDINGS_MODEL_MATRYOSHKA_DIMS: list[int] = [
+    384,
+    256,
+    128,
+    64,
+]  # Important: large to small
+USE_ARGILLA_ANNOTATIONS = False
diff --git a/llm-complete-guide/data/test_dataset.json b/llm-complete-guide/data/test_dataset.json