-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arbitrarily long docs #332
Conversation
# Conflicts: # spacy_llm/tasks/builtin_task.py
…done by spaCy's slicing directly.
Plumbing is done, old tests are passing. Still TBD: add tests for overlong documents, implement mapping for individual tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't been able to review everything yet, but left some comments and questions already to help me understand some of the design decisions.
Co-authored-by: Sofie Van Landeghem <[email protected]>
Co-authored-by: Sofie Van Landeghem <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks good to me, and in particular the "reduce" behaviour of the tasks all seem sensible. Will certainly be worth the effort to have this!
I guess one optimization could be doing something like a "consistency" check, running the aggregated response once more through an LLM to optimize for fluency (thinking about Summarization & Translation tasks for instance). But I think that can be a future refinement. |
Co-authored-by: Sofie Van Landeghem <[email protected]>
Co-authored-by: Sofie Van Landeghem <[email protected]>
Agreed on both points. I can totally see how that would improve the result, but wouldn't include it in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
Description
Support arbitrarily long docs by implementing a map-reduce approach: split docs into shards, process each shard in a separate prompt, then fuse shards together using a consensus-finding algorithm.
Steps:
Note: the consensus mechanism can be trivial - e. g. for NER: concat all .ents into new doc - or
not, e. g. for classification/regression tasks - what's a doc's class if there are three shards with
three different textcat results?
Remaining todos
LLMTask
vs.ShardableLLMTask
?Corresponding documentation PR
explosion/spaCy#13214
Types of change
New feature.
Checklist
tests
andusage_examples/tests
, and all new and existing tests passed. This includespytest
ran with--external
)pytest
ran with--gpu
)