Support arbitrarily long docs #332

rmitsch · 2023-10-17T10:32:53Z

Description

Support arbitrarily long docs by implementing a map-reduce approach: split docs into shards, process each shard in a separate prompt, then fuse shards together using a consensus-finding algorithm.

Steps:

Do until until entire doc has been processed:
1. Estimate n_tokens in rendered prompt
2. If n_tokens within bounds: use complete doc as one shard
3. Otherwise:
  - Identify split-off point
  - Split off first shard
  - Repeat from 1.iii on with rest of the (yet unsharded) doc
For each shard: execute prompt, parse response
Fuse shards into one doc instance by applying a consenus mechanism.
Note: the consensus mechanism can be trivial - e. g. for NER: concat all .ents into new doc - or
not, e. g. for classification/regression tasks - what's a doc's class if there are three shards with
three different textcat results?

Remaining todos

Mapping/reducing for individual tasks.
We currently require a context length, which is bound to the model name. As we now (or soon) allow arbitrary model names, we have to consider how cope with that - allow passing a context length argument? Assume infinite doc length and warn if model context length isn't known?
Add tests for overlong documents
Currently we only transfer those custom attributes during splitting and reducing that are relevant to the task (e. g. _.summary for the summarization task) - this means that we lost custom attributes with a pipeline consisting of multiple LLM components writing out to custom attributes. We need to address that - as of now tasks are not aware of other tasks in the pipeline. Maybe we transfer all available custom attributes?
LLMTask vs. ShardableLLMTask?
Update merged in EL task
Add docs

Corresponding documentation PR

explosion/spaCy#13214

Types of change

New feature.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran all tests in tests and usage_examples/tests, and all new and existing tests passed. This includes
- all external tests (i. e. pytest ran with --external)
- all tests requiring a GPU (i. e. pytest ran with --gpu)
My changes don't require a change to the documentation, or if they do, I've added all required information.

…e this.

# Conflicts: # spacy_llm/tasks/builtin_task.py

…done by spaCy's slicing directly.

rmitsch · 2023-11-06T15:39:30Z

Plumbing is done, old tests are passing. Still TBD: add tests for overlong documents, implement mapping for individual tasks.

spacy_llm/cache.py

spacy_llm/ty.py

svlandeg

Haven't been able to review everything yet, but left some comments and questions already to help me understand some of the design decisions.

spacy_llm/models/langchain/model.py

spacy_llm/models/hf/base.py

spacy_llm/pipeline/llm.py

Co-authored-by: Sofie Van Landeghem <[email protected]>

svlandeg

This mostly looks good to me, and in particular the "reduce" behaviour of the tasks all seem sensible. Will certainly be worth the effort to have this!

spacy_llm/models/hf/base.py

spacy_llm/models/hf/falcon.py

spacy_llm/models/rest/anthropic/__init__.py

spacy_llm/pipeline/llm.py

spacy_llm/tasks/entity_linker/util.py

svlandeg · 2023-12-07T16:24:56Z

I guess one optimization could be doing something like a "consistency" check, running the aggregated response once more through an LLM to optimize for fluency (thinking about Summarization & Translation tasks for instance). But I think that can be a future refinement.

Co-authored-by: Sofie Van Landeghem <[email protected]>

rmitsch · 2023-12-07T16:45:05Z

I guess one optimization could be doing something like a "consistency" check, running the aggregated response once more through an LLM to optimize for fluency (thinking about Summarization & Translation tasks for instance). But I think that can be a future refinement.

Agreed on both points. I can totally see how that would improve the result, but wouldn't include it in this PR.

spacy_llm/cache.py

svlandeg

🥳

Add context length info. Refactor BuiltinTask and models to facilitat…

3aba660

…e this.

rmitsch added feat/new New feature feat/task Feature: tasks labels Oct 17, 2023

rmitsch self-assigned this Oct 17, 2023

rmitsch added 15 commits October 17, 2023 12:37

Merge branch 'develop' into feat/inf-doc-len

5699773

# Conflicts: # spacy_llm/tasks/builtin_task.py

Add token count estimator plumbing.

4213372

Add plumbing for mapper and reducer.

f440ca4

Add ShardMapper prototype.

e47f762

Integrating mapping into prompt generation workflow.

89a5510

Update response parsing and component to support sharding (WIP).

086dec9

Fix shard & prompt flow.

23718fc

Fix shard & prompt flow.

7ce670d

Remove todo comments.

0d75ea8

Fix Anthropic, Cohere, NoOp model tests.

9da7098

Merge branch 'develop' into feat/inf-doc-len

0cb9afd

Fix test_llm_pipe().

f368412

Fix type checking test.

b1f111d

Fix span parsing tests.

44a2787

Fix internal tests.

6d8cdc7

rmitsch added Test external Run external tests Test GPU Run GPU tests labels Nov 3, 2023

rmitsch added 7 commits November 3, 2023 16:20

Fix _CountTask.

e712f41

Fix sentiment and summarization tasks and tests.

985fd68

Fix Azure connection URL. Fix Model test pings.

98842a2

Fix Lemma parsing.

b54a3d9

Start work on doc-to-shard property copying.

9bf365d

Fix REL doc preprocessing.

dddfaab

Remove comment on doc attribute handling during sharding, as this is …

3af21b5

…done by spaCy's slicing directly.

Add reducer implementations.

fee9ca7

rmitsch added 9 commits November 29, 2023 10:33

Fix tests.

056730a

Add NER sharding test.

196c235

Add REL and sentiment sharding tests.

1f51a4a

Add summary sharding tests.

e18b302

Add EL sharding task. Fix bug in shard mapper.

7c092ca

Fix REL error with RELExample parsing.

358ba72

Use regex for punctuation in REL conversion.

0c96fb6

Maintain custom doc attributes, incl. test.

dc926bd

Filter merge warnings in textcat reduction.

5585174

rmitsch marked this pull request as ready for review December 1, 2023 10:58

svlandeg reviewed Dec 1, 2023

View reviewed changes

spacy_llm/cache.py Outdated Show resolved Hide resolved

spacy_llm/ty.py Show resolved Hide resolved

spacy_llm/ty.py Outdated Show resolved Hide resolved

rmitsch mentioned this pull request Dec 1, 2023

Support new OS models: Zephyr and Yi #392

Draft

3 tasks

svlandeg reviewed Dec 1, 2023

View reviewed changes

rmitsch and others added 4 commits December 4, 2023 17:11

Fix custom doc data merging.

1ae710c

Update spacy_llm/models/langchain/model.py

e94b356

Co-authored-by: Sofie Van Landeghem <[email protected]>

Update spacy_llm/pipeline/llm.py

e68f5d3

Co-authored-by: Sofie Van Landeghem <[email protected]>

Incorporate feedback.

f40bc88

svlandeg reviewed Dec 7, 2023

View reviewed changes

rmitsch and others added 4 commits December 7, 2023 17:32

Move sharding compatibility warning to component constructor.

ac0559d

Update spacy_llm/tasks/entity_linker/util.py

1763821

Co-authored-by: Sofie Van Landeghem <[email protected]>

Update spacy_llm/models/hf/base.py

ae2e837

Co-authored-by: Sofie Van Landeghem <[email protected]>

Incorporate feedback.

63367fa

svlandeg reviewed Dec 11, 2023

View reviewed changes

spacy_llm/cache.py Outdated Show resolved Hide resolved

Fix doc string

e2f3ad8

svlandeg approved these changes Dec 11, 2023

View reviewed changes

svlandeg merged commit a6515bf into develop Dec 11, 2023
10 checks passed

svlandeg deleted the feat/inf-doc-len branch December 11, 2023 14:27

rmitsch mentioned this pull request Dec 27, 2023

Updated docs w.r.t. infinite doc length changes explosion/spaCy#13214

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arbitrarily long docs #332

Support arbitrarily long docs #332

rmitsch commented Oct 17, 2023 •

edited

Loading

rmitsch commented Nov 6, 2023 •

edited

Loading

svlandeg left a comment

svlandeg left a comment

svlandeg commented Dec 7, 2023

rmitsch commented Dec 7, 2023

svlandeg left a comment

Support arbitrarily long docs #332

Support arbitrarily long docs #332

Conversation

rmitsch commented Oct 17, 2023 • edited Loading

Description

Corresponding documentation PR

Types of change

Checklist

rmitsch commented Nov 6, 2023 • edited Loading

svlandeg left a comment

Choose a reason for hiding this comment

svlandeg left a comment

Choose a reason for hiding this comment

svlandeg commented Dec 7, 2023

rmitsch commented Dec 7, 2023

svlandeg left a comment

Choose a reason for hiding this comment

rmitsch commented Oct 17, 2023 •

edited

Loading

rmitsch commented Nov 6, 2023 •

edited

Loading