Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrarily long docs #332

Merged
merged 56 commits into from
Dec 11, 2023
Merged

Support arbitrarily long docs #332

merged 56 commits into from
Dec 11, 2023

Conversation

rmitsch
Copy link
Collaborator

@rmitsch rmitsch commented Oct 17, 2023

Description

Support arbitrarily long docs by implementing a map-reduce approach: split docs into shards, process each shard in a separate prompt, then fuse shards together using a consensus-finding algorithm.

Steps:

  1. Do until until entire doc has been processed:
    1. Estimate n_tokens in rendered prompt
    2. If n_tokens within bounds: use complete doc as one shard
    3. Otherwise:
      • Identify split-off point
      • Split off first shard
      • Repeat from 1.iii on with rest of the (yet unsharded) doc
  2. For each shard: execute prompt, parse response
  3. Fuse shards into one doc instance by applying a consenus mechanism.
    Note: the consensus mechanism can be trivial - e. g. for NER: concat all .ents into new doc - or
    not, e. g. for classification/regression tasks - what's a doc's class if there are three shards with
    three different textcat results?

Remaining todos

  • Mapping/reducing for individual tasks.
  • We currently require a context length, which is bound to the model name. As we now (or soon) allow arbitrary model names, we have to consider how cope with that - allow passing a context length argument? Assume infinite doc length and warn if model context length isn't known?
  • Add tests for overlong documents
  • Currently we only transfer those custom attributes during splitting and reducing that are relevant to the task (e. g. _.summary for the summarization task) - this means that we lost custom attributes with a pipeline consisting of multiple LLM components writing out to custom attributes. We need to address that - as of now tasks are not aware of other tasks in the pipeline. Maybe we transfer all available custom attributes?
  • LLMTask vs. ShardableLLMTask?
  • Update merged in EL task
  • Add docs

Corresponding documentation PR

explosion/spaCy#13214

Types of change

New feature.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran all tests in tests and usage_examples/tests, and all new and existing tests passed. This includes
    • all external tests (i. e. pytest ran with --external)
    • all tests requiring a GPU (i. e. pytest ran with --gpu)
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@rmitsch rmitsch added feat/new New feature feat/task Feature: tasks labels Oct 17, 2023
@rmitsch rmitsch self-assigned this Oct 17, 2023
@rmitsch rmitsch added Test external Run external tests Test GPU Run GPU tests labels Nov 3, 2023
@rmitsch
Copy link
Collaborator Author

rmitsch commented Nov 6, 2023

Plumbing is done, old tests are passing. Still TBD: add tests for overlong documents, implement mapping for individual tasks.

@rmitsch rmitsch marked this pull request as ready for review December 1, 2023 10:58
spacy_llm/cache.py Outdated Show resolved Hide resolved
spacy_llm/ty.py Show resolved Hide resolved
spacy_llm/ty.py Outdated Show resolved Hide resolved
@rmitsch rmitsch mentioned this pull request Dec 1, 2023
3 tasks
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't been able to review everything yet, but left some comments and questions already to help me understand some of the design decisions.

spacy_llm/models/langchain/model.py Outdated Show resolved Hide resolved
spacy_llm/models/hf/base.py Outdated Show resolved Hide resolved
spacy_llm/pipeline/llm.py Outdated Show resolved Hide resolved
spacy_llm/pipeline/llm.py Outdated Show resolved Hide resolved
spacy_llm/pipeline/llm.py Outdated Show resolved Hide resolved
spacy_llm/pipeline/llm.py Outdated Show resolved Hide resolved
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good to me, and in particular the "reduce" behaviour of the tasks all seem sensible. Will certainly be worth the effort to have this!

spacy_llm/models/hf/base.py Outdated Show resolved Hide resolved
spacy_llm/models/hf/base.py Outdated Show resolved Hide resolved
spacy_llm/models/hf/falcon.py Outdated Show resolved Hide resolved
spacy_llm/models/rest/anthropic/__init__.py Show resolved Hide resolved
spacy_llm/pipeline/llm.py Outdated Show resolved Hide resolved
spacy_llm/tasks/entity_linker/util.py Outdated Show resolved Hide resolved
@svlandeg
Copy link
Member

svlandeg commented Dec 7, 2023

I guess one optimization could be doing something like a "consistency" check, running the aggregated response once more through an LLM to optimize for fluency (thinking about Summarization & Translation tasks for instance). But I think that can be a future refinement.

@rmitsch
Copy link
Collaborator Author

rmitsch commented Dec 7, 2023

I guess one optimization could be doing something like a "consistency" check, running the aggregated response once more through an LLM to optimize for fluency (thinking about Summarization & Translation tasks for instance). But I think that can be a future refinement.

Agreed on both points. I can totally see how that would improve the result, but wouldn't include it in this PR.

spacy_llm/cache.py Outdated Show resolved Hide resolved
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

@svlandeg svlandeg merged commit a6515bf into develop Dec 11, 2023
10 checks passed
@svlandeg svlandeg deleted the feat/inf-doc-len branch December 11, 2023 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/new New feature feat/task Feature: tasks Test external Run external tests Test GPU Run GPU tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants