From d41050babad454d46fa28063ea2749048573a68a Mon Sep 17 00:00:00 2001 From: Raphael Mitsch Date: Wed, 27 Dec 2023 13:10:48 +0100 Subject: [PATCH 1/5] Updated docs w.r.t. infinite doc length. --- website/docs/api/large-language-models.mdx | 150 ++++++++++++++++--- website/docs/usage/large-language-models.mdx | 48 +++++- 2 files changed, 169 insertions(+), 29 deletions(-) diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx index d658e9dda3e..934ee050783 100644 --- a/website/docs/api/large-language-models.mdx +++ b/website/docs/api/large-language-models.mdx @@ -9,8 +9,8 @@ menu: - ['Various Functions', 'various-functions'] --- -[The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large -Language Models (LLMs) into spaCy, featuring a modular system for **fast +[The `spacy-llm` package](https://github.com/explosion/spacy-llm) integrates +Large Language Models (LLMs) into spaCy, featuring a modular system for **fast prototyping** and **prompting**, and turning unstructured responses into **robust outputs** for various NLP tasks, **no training data** required. @@ -202,13 +202,82 @@ not require labels. ## Tasks {id="tasks"} -### Task implementation {id="task-implementation"} +In `spacy-llm`, a _task_ defines an NLP problem or question and its solution +using an LLM. It does so by implementing the following responsibilities: -A _task_ defines an NLP problem or question, that will be sent to the LLM via a -prompt. Further, the task defines how to parse the LLM's responses back into -structured information. All tasks are registered in the `llm_tasks` registry. +1. Loading a prompt template and injecting documents' data into the prompt. + Optionally, include fewshot examples in the prompt. +2. Splitting the prompt into several pieces following a map-reduce paradigm, + _if_ the prompt is too long to fit into the model's context and the task + supports sharding prompts. +3. Parsing the LLM's responses back into structured information and validating + the parsed output. -#### task.generate_prompts {id="task-generate-prompts"} +Two different task interfaces are supported: `ShardingLLMTask` and +`NonShardingLLMTask`. Only the former supports the sharding of documents, i. e. +splitting up prompts if they are too long. + +All tasks are registered in the `llm_tasks` registry. + +### On Sharding {id="task-sharding"} + +"Sharding" describes, generally speaking, the process of distributing parts of a +dataset across multiple storage units for easier processing and lookups. In +`spacy-llm` we use this term (synonymously: "mapping") to describe the splitting +up of prompts if they are too long for a model to handle, and "fusing" +(synonymously: "reducing") to describe how the model responses for several shars +are merged back together into a single document. + +Prompts are broken up in a manner that _always_ keeps the prompt in the template +intact, meaning that the instructions to the LLM will always stay complete. The +document content however will be split, if the length of the fully rendered +prompt exceeds a model context length. + +A toy example: let's assume a model has a context window of 25 tokens and the +prompt template for our fictional, sharding-supporting task looks like this: + +``` +Estimate the sentiment of this text: +"{text}" +Estimated entiment: +``` + +Depening on how tokens are counted exactly (this is a config setting), we might +come up with `n = 12` tokens for the number of tokens in the prompt +instructions. Furthermore let's assume that our `text` is "This has been +amazing - I can't remember the last time I left the cinema so impressed." - +which has roughly 19 tokens. + +Considering we only have 13 tokens to add to our prompt before we hit the +context limit, we'll have to split our prompt into two parts. Thus `spacy-llm`, +assuming the task used supports sharding, will split the prompt into two (the +default splitting strategy splits by tokens, but alternative splitting +strategies splitting e. g. by sentences can be configured): + +_(Prompt 1/2)_ + +``` +Estimate the sentiment of this text: +"This has been amazing - I can't remember " +Estimated entiment: +``` + +_(Prompt 2/2)_ + +``` +Estimate the sentiment of this text: +"the last time I left the cinema so impressed." +Estimated entiment: +``` + +The reduction step is task-specific - a sentiment estimation task might e. g. do +a weighted average of the sentiment scores. Note that prompt sharding introduces +potential inaccuracies, as the LLM won't have access to the entire document at +once. Depending on your use case this might or might not be problematic. + +### `NonShardingLLMTask` {id="task-nonsharding"} + +#### task.generate_prompts {id="task-nonsharding-generate-prompts"} Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. Often, prompts are of type `str` - but this is not @@ -219,7 +288,7 @@ enforced to allow for maximum flexibility in the framework. | `docs` | The input documents. ~~Iterable[Doc]~~ | | **RETURNS** | The generated prompts. ~~Iterable[Any]~~ | -#### task.parse_responses {id="task-parse-responses"} +#### task.parse_responses {id="task-non-sharding-parse-responses"} Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the @@ -230,19 +299,44 @@ defined fields. The `responses` are of type `Iterable[Any]`, though they will often be `str` objects. This depends on the return type of the [model](#models). -| Argument | Description | -| ----------- | ------------------------------------------ | -| `docs` | The input documents. ~~Iterable[Doc]~~ | -| `responses` | The generated prompts. ~~Iterable[Any]~~ | -| **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ | +| Argument | Description | +| ----------- | ------------------------------------------------------ | +| `docs` | The input documents. ~~Iterable[Doc]~~ | +| `responses` | The responses received from the LLM. ~~Iterable[Any]~~ | +| **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ | -### Raw prompting {id="raw"} +### `ShardingLLMTask` {id="task-sharding"} -Different to all other tasks `spacy.Raw.vX` doesn't provide a specific prompt, -wrapping doc data, to the model. Instead it instructs the model to reply to the -doc content. This is handy for use cases like question answering (where each doc -contains one question) or if you want to include customized prompts for each -doc. +#### task.generate_prompts {id="task-sharding-generate-prompts"} + +Takes a collection of documents, breaks them up into shards if necessary to fit +all content into the model's context, and returns a collection of collections of +"prompts" (i. e. each doc can have multiple shards, each of which have exactly +one prompt), which can be of type `Any`. Often, prompts are of type `str` - but +this is not enforced to allow for maximum flexibility in the framework. + +| Argument | Description | +| ----------- | -------------------------------------------------- | +| `docs` | The input documents. ~~Iterable[Doc]~~ | +| **RETURNS** | The generated prompts. ~~Iterable[Iterable[Any]]~~ | + +#### task.parse_responses {id="task-sharding-parse-responses"} + +Receives a collection of collection of LLM responses (i. e. each doc can have +multiple shards, each of which have exactly one prompt / prompt response) and +the original shards, parses the responses into structured information, sets the +annotations on the shards, and merges back doc shards into single docs. The +`parse_responses` function is free to set the annotations in any way, including +`Doc` fields like `ents`, `spans` or `cats`, or using custom defined fields. + +The `responses` are of type `Iterable[Iterable[Any]]`, though they will often be +`str` objects. This depends on the return type of the [model](#models). + +| Argument | Description | +| ----------- | ---------------------------------------------------------------- | +| `shards` | The input document shards. ~~Iterable[Iterable[Doc]]~~ | +| `responses` | The responses received from the LLM. ~~Iterable[Iterable[Any]]~~ | +| **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ | ### Translation {id="translation"} @@ -295,6 +389,14 @@ target_lang = "Spanish" path = "translation_examples.yml" ``` +### Raw prompting {id="raw"} + +Different to all other tasks `spacy.Raw.vX` doesn't provide a specific prompt, +wrapping doc data, to the model. Instead it instructs the model to reply to the +doc content. This is handy for use cases like question answering (where each doc +contains one question) or if you want to include customized prompts for each +doc. + #### spacy.Raw.v1 {id="raw-v1"} Note that since this task may request arbitrary information, it doesn't do any @@ -1239,9 +1341,15 @@ A _model_ defines which LLM model to query, and how to query it. It can be a simple function taking a collection of prompts (consistent with the output type of `task.generate_prompts()`) and returning a collection of responses (consistent with the expected input of `parse_responses`). Generally speaking, -it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific +it's a function of type +`Callable[[Iterable[Iterable[Any]]], Iterable[Iterable[Any]]]`, but specific implementations can have other signatures, like -`Callable[[Iterable[str]], Iterable[str]]`. +`Callable[[Iterable[Iterable[str]]], Iterable[Iterable[str]]]`. + +Note: the model signature expects a nested iterable so it's able to deal with +sharded docs. Unsharded docs (i. e. those produced by (nonsharding +tasks)[/api/large-language-models#task-nonsharding]) are reshaped to fit the +expected data structure. ### Models via REST API {id="models-rest"} diff --git a/website/docs/usage/large-language-models.mdx b/website/docs/usage/large-language-models.mdx index 43b22ce0728..9507e556c49 100644 --- a/website/docs/usage/large-language-models.mdx +++ b/website/docs/usage/large-language-models.mdx @@ -340,15 +340,45 @@ A _task_ defines an NLP problem or question, that will be sent to the LLM via a prompt. Further, the task defines how to parse the LLM's responses back into structured information. All tasks are registered in the `llm_tasks` registry. -Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined -in [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). -It needs to define a `generate_prompts` function and a `parse_responses` -function. +Practically speaking, a task should adhere to the `Protocol` named `LLMTask` +defined in +[`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). It +needs to define a `generate_prompts` function and a `parse_responses` function. -| Task | Description | -| --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. | -| [`task.parse_responses`](/api/large-language-models#task-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. | +Tasks may support prompt sharding (for more info see the API docs on +[sharding](/api/large-language-models#task-sharding) and +[non-sharding](/api/large-language-models#task-nonsharding) tasks). The function +signatures for `generate_prompts` and `parse_responses` depend on whether they +do. + +| _For tasks *not supporting* sharding:_ | Task | Description | | +| -------------------------------------- | ---- | ----------- | --- | + +--- + +| | +[`task.generate_prompts`](/api/large-language-models#task-nonsharding-generate-prompts) +| Takes a collection of documents, and returns a collection of prompts, which +can be of type `Any`. | | +[`task.parse_responses`](/api/large-language-models#task-nonsharding-parse-responses) +| Takes a collection of LLM responses and the original documents, parses the +responses into structured information, and sets the annotations on the +documents. | + +| _For tasks *supporting* sharding:_ | Task | Description | | +| ---------------------------------- | ---- | ----------- | --- | + +--- + +| | +[`task.generate_prompts`](/api/large-language-models#task-sharding-generate-prompts) +| Takes a collection of documents, and returns a collection of collection of +prompt shards, which can be of type `Any`. | | +[`task.parse_responses`](/api/large-language-models#task-sharding-parse-responses) +| Takes a collection of collection of LLM responses (one per prompt shard) and +the original documents, parses the responses into structured information, sets +the annotations on the doc shards, and merges those doc shards back into a +single doc instance. | Moreover, the task may define an optional [`scorer` method](/api/scorer#score). It should accept an iterable of `Example` objects as input and return a score @@ -370,7 +400,9 @@ evaluate the component. | [`spacy.TextCat.v2`](/api/large-language-models#textcat-v2) | Version 2 builds on v1 and includes an improved prompt template. | | [`spacy.TextCat.v1`](/api/large-language-models#textcat-v1) | Version 1 of the built-in TextCat task supports both zero-shot and few-shot prompting. | | [`spacy.Lemma.v1`](/api/large-language-models#lemma-v1) | Lemmatizes the provided text and updates the `lemma_` attribute of the tokens accordingly. | +| [`spacy.Raw.v1`](/api/large-language-models#raw-v1) | Executes raw doc content as prompt to LLM. | | [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1) | Performs sentiment analysis on provided texts. | +| [`spacy.Translation.v1`](/api/large-language-models#translation-v1) | Translates doc content into the specified target language. | | [`spacy.NoOp.v1`](/api/large-language-models#noop-v1) | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. | #### Providing examples for few-shot prompts {id="few-shot-prompts"} From fbf255f891f356154517376ebc54ee81f2f6255a Mon Sep 17 00:00:00 2001 From: Raphael Mitsch Date: Wed, 27 Dec 2023 13:16:17 +0100 Subject: [PATCH 2/5] Fix typo. --- website/docs/api/large-language-models.mdx | 2 +- website/docs/usage/large-language-models.mdx | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx index 934ee050783..1fa6a27cb30 100644 --- a/website/docs/api/large-language-models.mdx +++ b/website/docs/api/large-language-models.mdx @@ -322,7 +322,7 @@ this is not enforced to allow for maximum flexibility in the framework. #### task.parse_responses {id="task-sharding-parse-responses"} -Receives a collection of collection of LLM responses (i. e. each doc can have +Receives a collection of collections of LLM responses (i. e. each doc can have multiple shards, each of which have exactly one prompt / prompt response) and the original shards, parses the responses into structured information, sets the annotations on the shards, and merges back doc shards into single docs. The diff --git a/website/docs/usage/large-language-models.mdx b/website/docs/usage/large-language-models.mdx index 9507e556c49..e185b726f69 100644 --- a/website/docs/usage/large-language-models.mdx +++ b/website/docs/usage/large-language-models.mdx @@ -372,10 +372,10 @@ documents. | | | [`task.generate_prompts`](/api/large-language-models#task-sharding-generate-prompts) -| Takes a collection of documents, and returns a collection of collection of +| Takes a collection of documents, and returns a collection of collections of prompt shards, which can be of type `Any`. | | [`task.parse_responses`](/api/large-language-models#task-sharding-parse-responses) -| Takes a collection of collection of LLM responses (one per prompt shard) and +| Takes a collection of collections of LLM responses (one per prompt shard) and the original documents, parses the responses into structured information, sets the annotations on the doc shards, and merges those doc shards back into a single doc instance. | From 30d7c917be989174c52a43c57c3820947f6b3464 Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Fri, 29 Dec 2023 21:28:03 +0100 Subject: [PATCH 3/5] fix typo's --- website/docs/api/large-language-models.mdx | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx index 1fa6a27cb30..9e6616ceaee 100644 --- a/website/docs/api/large-language-models.mdx +++ b/website/docs/api/large-language-models.mdx @@ -225,7 +225,7 @@ All tasks are registered in the `llm_tasks` registry. dataset across multiple storage units for easier processing and lookups. In `spacy-llm` we use this term (synonymously: "mapping") to describe the splitting up of prompts if they are too long for a model to handle, and "fusing" -(synonymously: "reducing") to describe how the model responses for several shars +(synonymously: "reducing") to describe how the model responses for several shards are merged back together into a single document. Prompts are broken up in a manner that _always_ keeps the prompt in the template @@ -239,10 +239,10 @@ prompt template for our fictional, sharding-supporting task looks like this: ``` Estimate the sentiment of this text: "{text}" -Estimated entiment: +Estimated sentiment: ``` -Depening on how tokens are counted exactly (this is a config setting), we might +Depending on how tokens are counted exactly (this is a config setting), we might come up with `n = 12` tokens for the number of tokens in the prompt instructions. Furthermore let's assume that our `text` is "This has been amazing - I can't remember the last time I left the cinema so impressed." - @@ -259,7 +259,7 @@ _(Prompt 1/2)_ ``` Estimate the sentiment of this text: "This has been amazing - I can't remember " -Estimated entiment: +Estimated sentiment: ``` _(Prompt 2/2)_ @@ -267,7 +267,7 @@ _(Prompt 2/2)_ ``` Estimate the sentiment of this text: "the last time I left the cinema so impressed." -Estimated entiment: +Estimated sentiment: ``` The reduction step is task-specific - a sentiment estimation task might e. g. do From f6e9814a1d9b27b291e61b788a9a18ba483dc58d Mon Sep 17 00:00:00 2001 From: Raphael Mitsch Date: Fri, 5 Jan 2024 12:55:58 +0100 Subject: [PATCH 4/5] Fix table formatting. --- website/docs/usage/large-language-models.mdx | 35 ++++++-------------- 1 file changed, 10 insertions(+), 25 deletions(-) diff --git a/website/docs/usage/large-language-models.mdx b/website/docs/usage/large-language-models.mdx index e185b726f69..cb35696a353 100644 --- a/website/docs/usage/large-language-models.mdx +++ b/website/docs/usage/large-language-models.mdx @@ -351,34 +351,19 @@ Tasks may support prompt sharding (for more info see the API docs on signatures for `generate_prompts` and `parse_responses` depend on whether they do. -| _For tasks *not supporting* sharding:_ | Task | Description | | -| -------------------------------------- | ---- | ----------- | --- | +_For tasks *not supporting* sharding:_ ---- - -| | -[`task.generate_prompts`](/api/large-language-models#task-nonsharding-generate-prompts) -| Takes a collection of documents, and returns a collection of prompts, which -can be of type `Any`. | | -[`task.parse_responses`](/api/large-language-models#task-nonsharding-parse-responses) -| Takes a collection of LLM responses and the original documents, parses the -responses into structured information, and sets the annotations on the -documents. | +| Task | Description | | +| --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- | +| [`task.generate_prompts`](/api/large-language-models#task-nonsharding-generate-prompts) | Takes a collection of documents, and returns a collection of prompts, which can be of type `Any`. | +| [`task.parse_responses`](/api/large-language-models#task-nonsharding-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. | -| _For tasks *supporting* sharding:_ | Task | Description | | -| ---------------------------------- | ---- | ----------- | --- | - ---- +_For tasks *supporting* sharding:_ -| | -[`task.generate_prompts`](/api/large-language-models#task-sharding-generate-prompts) -| Takes a collection of documents, and returns a collection of collections of -prompt shards, which can be of type `Any`. | | -[`task.parse_responses`](/api/large-language-models#task-sharding-parse-responses) -| Takes a collection of collections of LLM responses (one per prompt shard) and -the original documents, parses the responses into structured information, sets -the annotations on the doc shards, and merges those doc shards back into a -single doc instance. | +| Task | Description | | +| ------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- | +| [`task.generate_prompts`](/api/large-language-models#task-sharding-generate-prompts) | Takes a collection of documents, and returns a collection of collections of prompt shards, which can be of type `Any`. | +| [`task.parse_responses`](/api/large-language-models#task-sharding-parse-responses) | Takes a collection of collections of LLM responses (one per prompt shard) and the original documents, parses the responses into structured information, sets the annotations on the doc shards, and merges those doc shards back into a single doc instance. | Moreover, the task may define an optional [`scorer` method](/api/scorer#score). It should accept an iterable of `Example` objects as input and return a score From 4e0001ad1eb03a961210158998da626cd93ed972 Mon Sep 17 00:00:00 2001 From: Raphael Mitsch Date: Fri, 5 Jan 2024 13:02:07 +0100 Subject: [PATCH 5/5] Update formatting. --- website/docs/usage/large-language-models.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/usage/large-language-models.mdx b/website/docs/usage/large-language-models.mdx index cb35696a353..c799e91f3fe 100644 --- a/website/docs/usage/large-language-models.mdx +++ b/website/docs/usage/large-language-models.mdx @@ -351,14 +351,14 @@ Tasks may support prompt sharding (for more info see the API docs on signatures for `generate_prompts` and `parse_responses` depend on whether they do. -_For tasks *not supporting* sharding:_ +For tasks **not supporting** sharding: | Task | Description | | | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- | | [`task.generate_prompts`](/api/large-language-models#task-nonsharding-generate-prompts) | Takes a collection of documents, and returns a collection of prompts, which can be of type `Any`. | | [`task.parse_responses`](/api/large-language-models#task-nonsharding-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. | -_For tasks *supporting* sharding:_ +For tasks **supporting** sharding: | Task | Description | | | ------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |