Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RawTask executing docs ad verbatim #395

Merged
merged 65 commits into from
Dec 11, 2023
Merged
Changes from 1 commit
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
3aba660
Add context length info. Refactor BuiltinTask and models to facilitat…
rmitsch Oct 17, 2023
5699773
Merge branch 'develop' into feat/inf-doc-len
rmitsch Oct 17, 2023
4213372
Add token count estimator plumbing.
rmitsch Oct 17, 2023
f440ca4
Add plumbing for mapper and reducer.
rmitsch Oct 17, 2023
e47f762
Add ShardMapper prototype.
rmitsch Oct 18, 2023
89a5510
Integrating mapping into prompt generation workflow.
rmitsch Oct 19, 2023
086dec9
Update response parsing and component to support sharding (WIP).
rmitsch Oct 20, 2023
23718fc
Fix shard & prompt flow.
rmitsch Oct 27, 2023
7ce670d
Fix shard & prompt flow.
rmitsch Oct 27, 2023
0d75ea8
Remove todo comments.
rmitsch Oct 27, 2023
9da7098
Fix Anthropic, Cohere, NoOp model tests.
rmitsch Oct 27, 2023
0cb9afd
Merge branch 'develop' into feat/inf-doc-len
rmitsch Oct 30, 2023
f368412
Fix test_llm_pipe().
rmitsch Oct 31, 2023
b1f111d
Fix type checking test.
rmitsch Nov 3, 2023
44a2787
Fix span parsing tests.
rmitsch Nov 3, 2023
6d8cdc7
Fix internal tests.
rmitsch Nov 3, 2023
e712f41
Fix _CountTask.
rmitsch Nov 3, 2023
985fd68
Fix sentiment and summarization tasks and tests.
rmitsch Nov 3, 2023
98842a2
Fix Azure connection URL. Fix Model test pings.
rmitsch Nov 3, 2023
b54a3d9
Fix Lemma parsing.
rmitsch Nov 3, 2023
9bf365d
Start work on doc-to-shard property copying.
rmitsch Nov 3, 2023
dddfaab
Fix REL doc preprocessing.
rmitsch Nov 6, 2023
3af21b5
Remove comment on doc attribute handling during sharding, as this is …
rmitsch Nov 6, 2023
fee9ca7
Add reducer implementations.
rmitsch Nov 8, 2023
e508499
Implement outstanding task reducers.
rmitsch Nov 14, 2023
3218541
Resolve merge conflicts.
rmitsch Nov 14, 2023
c104387
Add shardable/non-shardable LLM task typing distinction. Add support …
rmitsch Nov 20, 2023
2c6d899
Merge branch 'develop' into feat/inf-doc-len
rmitsch Nov 21, 2023
2502c4d
Fix EL task.
rmitsch Nov 23, 2023
03055c5
Fix EL tokenization and highlighting partially.
rmitsch Nov 23, 2023
4e4a2cd
Fix tokenization and whitespaces for EL task.
rmitsch Nov 24, 2023
865acec
Fix merge conflicts.
rmitsch Nov 24, 2023
694d5da
Add new registry handlers (with context length and arbitrary model na…
rmitsch Nov 24, 2023
5295400
Add sharding test with simple count task.
rmitsch Nov 24, 2023
70e3643
Fix sharding algorithm.
rmitsch Nov 24, 2023
4321483
Add test with simple count task.
rmitsch Nov 27, 2023
ef6e738
Add context length as init arg in HF models.
rmitsch Nov 27, 2023
e3ff37d
Fix tests. Don't stringify IO lists if sharded.
rmitsch Nov 28, 2023
056730a
Fix tests.
rmitsch Nov 29, 2023
196c235
Add NER sharding test.
rmitsch Nov 29, 2023
1f51a4a
Add REL and sentiment sharding tests.
rmitsch Nov 29, 2023
e18b302
Add summary sharding tests.
rmitsch Nov 29, 2023
7c092ca
Add EL sharding task. Fix bug in shard mapper.
rmitsch Nov 29, 2023
358ba72
Fix REL error with RELExample parsing.
rmitsch Nov 29, 2023
0c96fb6
Use regex for punctuation in REL conversion.
rmitsch Nov 29, 2023
dc926bd
Maintain custom doc attributes, incl. test.
rmitsch Dec 1, 2023
5585174
Filter merge warnings in textcat reduction.
rmitsch Dec 1, 2023
1ae710c
Fix custom doc data merging.
rmitsch Dec 4, 2023
dc5efee
Add RawTask.
rmitsch Dec 5, 2023
dac8ae3
Fix task version.
rmitsch Dec 5, 2023
01cccdf
Add sharding test.
rmitsch Dec 5, 2023
e94b356
Update spacy_llm/models/langchain/model.py
rmitsch Dec 7, 2023
e68f5d3
Update spacy_llm/pipeline/llm.py
rmitsch Dec 7, 2023
f40bc88
Incorporate feedback.
rmitsch Dec 7, 2023
ac0559d
Move sharding compatibility warning to component constructor.
rmitsch Dec 7, 2023
1763821
Update spacy_llm/tasks/entity_linker/util.py
rmitsch Dec 7, 2023
ae2e837
Update spacy_llm/models/hf/base.py
rmitsch Dec 7, 2023
63367fa
Incorporate feedback.
rmitsch Dec 7, 2023
795f675
Update spacy_llm/tasks/raw/registry.py
rmitsch Dec 8, 2023
99fe286
Apply suggestions from code review
rmitsch Dec 8, 2023
e728e2c
Merge branch 'feat/inf-doc-len' into feat/raw-task
rmitsch Dec 8, 2023
4070efa
Fix tests.
rmitsch Dec 8, 2023
281998b
Remove boilerplate text in raw template.
rmitsch Dec 8, 2023
87776d3
Fix sharding test.
rmitsch Dec 8, 2023
191e42b
Merge branch 'develop' into feat/raw-task
rmitsch Dec 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix tokenization and whitespaces for EL task.
rmitsch committed Nov 24, 2023
commit 4e4a2cdfac96558c11418a1fc8ab86c000e5ac76
58 changes: 48 additions & 10 deletions spacy_llm/tasks/entity_linker/task.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Any, Callable, Dict, Iterable, List, Optional, Set, Tuple, Type
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Type

from spacy import Language, Vocab
from spacy.pipeline import EntityLinker
@@ -158,7 +158,11 @@ def _get_prompt_data(

return {
"mentions_str": ", ".join(
[mention.text for hc, mention in zip(has_cands, shard.ents) if hc]
[
f"*{mention.text}*"
for hc, mention in zip(has_cands, shard.ents)
if hc
]
),
"mentions": [ent.text for hc, ent in zip(has_cands, shard.ents) if hc],
"entity_descriptions": [
@@ -263,26 +267,39 @@ def highlight_ents_in_doc(
i_ent = 0
new_ent_idx: List[Tuple[int, int]] = []
token_texts: List[str] = []
spaces: List[bool] = []
to_highlight = i_ent in ents_to_highlight_idx
offset = 0

for token in doc:
if i_ent < len(ents_idx) and token.i == ents_idx[i_ent][1]:
if to_highlight:
token_texts.append("*")
spaces.append(spaces[-1])
spaces[-2] = False
offset += 1
i_ent += 1
to_highlight = i_ent in ents_to_highlight_idx
if i_ent < len(ents_idx) and token.i == ents_idx[i_ent][0]:
if to_highlight:
token_texts.append("*")
spaces.append(False)
offset += 1
new_ent_idx.append(
(ents_idx[i_ent][0] + offset, ents_idx[i_ent][1] + offset)
)
token_texts.append(token.text)
spaces.append(token.whitespace_ != "")

# Cover edge case of doc ending with entity, in which case we need to close the * wrapping.
if len(ents_to_highlight_idx) and doc.ents[
ents_to_highlight_idx[-1]
].end == len(doc):
token_texts.append("*")
spaces.append(False)

# Create doc with new tokens and entities.
highlighted_doc = Doc(doc.vocab, words=token_texts)
highlighted_doc = Doc(doc.vocab, words=token_texts, spaces=spaces)
highlighted_doc.ents = [
Span(
doc=highlighted_doc,
@@ -304,19 +321,40 @@ def unhighlight_ents_in_doc(doc: Doc) -> Doc:
doc (Doc): Doc whose entities are to be highlighted.
RETURNS (Doc): Doc with highlighted entities.
"""
highlight_idx: Set[int] = {ent.start - 1 for ent in doc.ents} | {
ent.end for ent in doc.ents
highlight_start_idx = {
ent.start - 1
for ent in doc.ents
if ent.start - 1 > 0 and doc[ent.start - 1].text == "*"
}
ent_idx = [
(ent.start - i * 2 - 1, ent.end - i * 2 - 1)
for i, ent in enumerate(doc.ents)
]
highlight_end_idx = {ent.end for ent in doc.ents if doc[ent.end].text == "*"}
highlight_idx = highlight_start_idx | highlight_end_idx

# Compute entity indices with removed highlights.
ent_idx: List[Tuple[int, int]] = []
offset = 0
for ent in doc.ents:
is_highlighted = ent.start - 1 in highlight_start_idx
ent_idx.append(
(ent.start + offset - is_highlighted, ent.end + offset - is_highlighted)
)
offset -= 2 * is_highlighted

# Create doc with new tokens and entities.
tokens = [
token
for token in doc
if not (token.i in highlight_idx and token.text == "*")
]
unhighlighted_doc = Doc(
doc.vocab,
words=[token.text for token in doc if token.i not in highlight_idx],
words=[token.text for token in tokens],
# Use original token space, if token doesn't appear after * highlight. If so, insert space unconditionally.
spaces=[
token.whitespace_ != "" or token.i + 1 in highlight_idx
for i, token in enumerate(tokens)
],
)

unhighlighted_doc.ents = [
Span(
doc=unhighlighted_doc,
6 changes: 3 additions & 3 deletions spacy_llm/tasks/templates/entity_linker.v1.jinja
Original file line number Diff line number Diff line change
@@ -21,7 +21,7 @@ MENTIONS: {{ example.mention_str }}
ENTITIES:
{%- for ent_descs in example.entity_descriptions -%}
{% set mention_i = loop.index0 %}
- For * {{ example.mentions[loop.index0] }} *:
- For *{{ example.mentions[loop.index0] }}*:
{%- for ent_desc in ent_descs -%}
{# whitespace #}
{{ example.entity_ids[mention_i][loop.index0] }}. {{ ent_desc }}
@@ -51,7 +51,7 @@ REASONING:
SOLUTION:
{%- for solution in example.solutions -%}
{# whitespace #}
* {{ example.mentions[loop.index0] }} * ::: <{{ solution }}>
*{{ example.mentions[loop.index0] }}* ::: <{{ solution }}>
{%- endfor -%}
{# whitespace #}
{# whitespace #}
@@ -69,7 +69,7 @@ MENTIONS: {{ mentions_str }}
ENTITIES:
{%- for ent_descs in entity_descriptions -%}
{% set mention_i = loop.index0 %}
- For * {{ mentions[loop.index0] }} *:
- For *{{ mentions[loop.index0] }}*:
{%- for ent_desc in ent_descs -%}
{# whitespace #}
{{ entity_ids[mention_i][loop.index0] }}. {{ ent_desc }}
11 changes: 9 additions & 2 deletions spacy_llm/tests/tasks/test_entity_linker.py
Original file line number Diff line number Diff line change
@@ -352,7 +352,7 @@ def make_doc() -> Doc:
nlp.components[0][1]._task._auto_nil = False
doc = nlp(make_doc())
assert (
f"- For * Foo *:n {EntityLinker.NIL}. {UNAVAILABLE_ENTITY_DESC}"
f"- For *Foo*:n {EntityLinker.NIL}. {UNAVAILABLE_ENTITY_DESC}"
in doc.user_data["llm_io"]["llm"]["prompt"].replace("\\", "")
)
assert doc.ents[0].kb_id_ == EntityLinker.NIL
@@ -678,9 +678,16 @@ def test_ent_highlighting():
]

assert (
EntityLinkerTask.highlight_ents_in_doc(doc)
EntityLinkerTask.highlight_ents_in_doc(doc).text
== "Alice goes to *Boston* to see the *Boston Celtics* game."
)
assert (
EntityLinkerTask.unhighlight_ents_in_doc(
EntityLinkerTask.highlight_ents_in_doc(doc)
).text
== doc.text
== text
)


@pytest.mark.external