Releases: stanfordnlp/stanza
Multilingual Coref
multilingual coref!
- Added models which cover several different languages: one for combined Germanic and Romance languages, one for the Slavic languages available in UDCoref #1406
new features
- streamlit visualizer for semgrex/ssurgeon #1396
- updates to the constituency parser ensemble #1387
- accuracy improvements to the IN_ORDER oracle #1391
- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None
now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399
new models
- Spanish combined models #1395
- Add IACLT knesset to the HE combined models
- NER based on IACLT
- XCL (Classical Armenian) models with word vectors from Caval
bugfixes
- update tqdm usage to remove some duplicate code: #1413 3de69ca
- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
- Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
- actually include the visualization: #1421 thank you @bollwyvl
Multilingual Coref
multilingual coref!
- Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref #1406
new features
- streamlit visualizer for semgrex/ssurgeon #1396
- updates to the constituency parser ensemble #1387
- accuracy improvements to the IN_ORDER oracle #1391
- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None
now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399
new models
- Spanish combined models #1395
- Add IACLT knesset to the HE combined models
- NER based on IACLT
- XCL (Classical Armenian) models with word vectors from Caval
bugfixes
- update tqdm usage to remove some duplicate code: #1413 3de69ca
- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
- Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
- actually include the visualization: #1421 thank you @bollwyvl
Multilingual Coref
multilingual coref!
- Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref #1406
new features
- streamlit visualizer for semgrex/ssurgeon #1396
- updates to the constituency parser ensemble #1387
- accuracy improvements to the IN_ORDER oracle #1391
- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None
now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399
new models
- Spanish combined models #1395
- Add IACLT knesset to the HE combined models
- NER based on IACLT
- XCL (Classical Armenian) models with word vectors from Caval
bugfixes
- update tqdm usage to remove some duplicate code: #1413 3de69ca
- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
- Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
Old English, MWT improvements, and better memory management of Peft
Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.
Old English
MWT improvements
-
Fix words ending with
-nna
split into MWT stanfordnlp/handparsed-treebank@2c48d40 #1366 -
Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) #1371 #1378
-
Mark
start_char
andend_char
on an MWT if it is composed of exactly its subwords 2384089 #1361
Peft memory management
- Previous versions were loading multiple copies of the transformer in order to use adapters. To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names. This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models. huggingface/peft#1523 #1381 #1384
Other bugfixes and minor upgrades
-
Fix crash when trying to load previously unknown language #1360 381736f
-
Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: d180ae0 #1367
-
Try to avoid OOM in the POS in the Pipeline by reducing its max batch length 4271813
-
Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka) 597d48f
Other upgrades
-
Add * to the list of functional tags to drop in the constituency parser, helping Icelandic annotation 57bfa8b #1356 (comment)
-
Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: 4048cae 15b136b
PEFT Integration (with bugfixes)
Integrating PEFT into several different annotators
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate
model.
The biggest gains observed are with the constituency parser and the sentiment classifier.
Previously, the default_accurate
package used transformers where the head was trained but the transformer itself was not finetuned.
Model improvements
- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results #1320
- Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. #1335
- NER also trained with peft: unfortunately, no consistent improvements to scores #1336
- depparse includes peft: no consistent improvements yet #1337 #1344
- Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser #1341
- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. #1347
- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. #1348
- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. #1346 #1345
Features
- Include SpacesAfter annotations on words in the CoNLL output of documents: #1315 #1322
- Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. #1331 #1330
- wandb support for coref #1338
- Coref annotator breaks length ties using POS if available #1326 c4c3de5
Bugfixes
- Using a proxy with
download_resources_json
was broken: #1318 #1317 Thank you @ider-zh - Fix deprecation warnings for escape sequences: #1321 #1293 Thank you @sterliakov
- Coref training rounding error #1342
- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice #1354
- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. #1350 #1294
- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: #1333 #1339 f1fbaaa
- Clarify error when a language is only partially handled: da01644 #1310
Additional 1.8.1 Bugfixes
PEFT integration
Integrating PEFT into several different annotators
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate
model.
The biggest gains observed are with the constituency parser and the sentiment classifier.
Previously, the default_accurate
package used transformers where the head was trained but the transformer itself was not finetuned.
Model improvements
- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results #1320
- Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. #1335
- NER also trained with peft: unfortunately, no consistent improvements to scores #1336
- depparse includes peft: no consistent improvements yet #1337 #1344
- Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser #1341
- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. #1347
- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. #1348
- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. #1346 #1345
Features
- Include SpacesAfter annotations on words in the CoNLL output of documents: #1315 #1322
- Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. #1331 #1330
- wandb support for coref #1338
- Coref annotator breaks length ties using POS if available #1326 c4c3de5
Bugfixes
- Using a proxy with
download_resources_json
was broken: #1318 #1317 Thank you @ider-zh - Fix deprecation warnings for escape sequences: #1321 #1293 Thank you @sterliakov
- Coref training rounding error #1342
- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice #1354
- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. #1350 #1294
- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: #1333 #1339 f1fbaaa
- Clarify error when a language is only partially handled: da01644 #1310
v1.7.0: Neural coref!
Neural coref processor added!
Conjunction-Aware Word-Level Coreference Resolution
https://arxiv.org/abs/2310.06165
original implementation: https://github.com/KarelDO/wl-coref/tree/master
Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref
If you use Stanza's coref module in your work, please be sure to cite both of the above papers.
Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.
Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.
Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models
Interface change: English MWT
English now has an MWT model by default. Text such as won't
is now marked as a single token, split into two words, will
and not
. Previously it was expected to be tokenized into two pieces, but the Sentence
object containing that text would not have a single Token
object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.
Code that used to operate with for word in sentence.words
will continue to work as before, but for token in sentence.tokens
will now produce one object for MWT such as won't
, cannot
, Stanza's
, etc.
Pipeline creation will not change, as MWT is automatically (but not silently) added at Pipeline
creation time if the language and package includes MWT.
Other updates
- NetworkX representation of enhanced dependencies. Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6 #1295 #1298
- Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) #1000 #1303
- Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors. Although we still encourage posting issues on github so we can fix them for everyone! #1302
- Remove deprecated output methods such as
conll_as_string
anddoc2conll_text
. Use"{:C}".format(doc)
instead e01650f - Mixed OntoNotes and WW NER model for English is now the default. Future versions may include CoNLL 2003 and CoNLL++ data as well.
- Sentences now have a
doc_id
field if the document they are created from has adoc_id
. 8e2201f - Optional processors added in cases where the user may not want the model we have run by default. For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model) 3d90d2b
Updated requirements
- Support dropped for python 3.6 and 3.7. The
peft
module used for finetuning the transformer used in the coref processor does not support those versions. - Added
peft
as an optional dependency to transformer based installations - Added
networkx
as a dependency for reading enhanced dependencies. Addedtoml
as a dependency for reading the coref config.
Multiple default models and a combined EN NER model
V1.6.1 is a patch of a bug in the Arabic POS tagger.
We also mark Python 3.11 as supported in the setup.py
classifiers. This will be the last release that supports Python 3.6
Multiple model levels
The package
parameter for building the Pipeline
now has three default settings:
default
, the same as before, where POS, depparse, and NER use the charlm, but lemma does notdefault-fast
, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as welldefault-accurate
, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome
Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast
and -accurate
versions for each UD dataset.
PR: #1287
Multiple output heads for one NER model
The NER models now can learn multiple output layers at once.
Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.
Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:
original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47
We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm
, ontonotes-combined_charlm
, and ontonotes-combined_electra-large
.
Future plans include using multiple NER datasets for other models as well.
Other features
-
Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a
Pipeline
, you can now provide acallable
via thetokenize_postprocessor
parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of thePipeline
#1290 -
Finetuning for transformers in the NER models: have not yet found helpful settings, though 45ef544
-
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code #1279 88cd0df
-
charlm for PT (improves accuracy on non-transformer models): c10763d
-
build models with transformers for a few additional languages: MR, AR, PT, JA 45b3875 0f3761e c55472a c10763d
Bugfixes
-
V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: b56f442
-
Scenegraph CoreNLP connection needed to be checked before sending messages: stanfordnlp/CoreNLP#1346 (comment) c71bf3f
-
run_ete.py
was not correctly processing the charlm, meaning the whole thing wouldn't actually run 16f29f3 -
Chinese NER model was pointing to the wrong pretrain #1285 82a0215
Multiple default models and a combined EN NER model
Multiple model levels
The package
parameter for building the Pipeline
now has three default settings:
default
, the same as before, where POS, depparse, and NER use the charlm, but lemma does notdefault-fast
, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as welldefault-accurate
, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome
Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast
and -accurate
versions for each UD dataset.
PR: #1287
Multiple output heads for one NER model
The NER models now can learn multiple output layers at once.
Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.
Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:
original ontonotes on worldwide: 88.71 69.29
simplify-separate 88.24 75.75
simplify-connected 88.32 75.47
We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm
, ontonotes-combined_charlm
, and ontonotes-combined_electra-large
.
Future plans include using multiple NER datasets for other models as well.
Other features
-
Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a
Pipeline
, you can now provide acallable
via thetokenize_postprocessor
parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of thePipeline
#1290 -
Finetuning for transformers in the NER models: have not yet found helpful settings, though 45ef544
-
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code #1279 88cd0df
-
charlm for PT (improves accuracy on non-transformer models): c10763d
-
build models with transformers for a few additional languages: MR, AR, PT, JA 45b3875 0f3761e c55472a c10763d
Bugfixes
-
Scenegraph CoreNLP connection needed to be checked before sending messages: stanfordnlp/CoreNLP#1346 (comment) c71bf3f
-
run_ete.py
was not correctly processing the charlm, meaning the whole thing wouldn't actually run 16f29f3 -
Chinese NER model was pointing to the wrong pretrain #1285 82a0215
v1.5.1: charlm & transformer integration in depparse
Features
depparse can have transformer as an embedding ee171cd
Lemmatizer can remember word,pos it has seen before with a flag #1263 a87ffd0
Scoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course) 63dc212 c42aed5 eab0623
SceneGraph connection for the CoreNLP client d21a95c
Update constituency parser to reduce the learning rate on plateau. Fiddling with the learning rates significantly improves performance f753a4f
Tokenize [] based on () rules if the original dataset doesn't have [] in it 063b4ba
Attempt to finetune the charlm when building models (have not found effective settings for this yet) 048fdc9
Add the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate e811f52 66add6d f086de2
Bugfixes
Forgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 4dda14b bjascob/LemmInflect#14 (comment)
prepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges 78ff85c
Fix an empty bulk_process
throwing an exception 5e2d15d #1278
Unroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors e0917b0
Minor updates
Put NER and POS scores on one line to make it easier to grep for: da2ae33 8c4cb04
Switch all pretrains to use a name which indicates their source, rather than the dataset they are used for: d1c68ed and many others
Pipeline uses torch.no_grad()
for a slight speed boost 36ab82e
Generalize save names, which eventually allows for putting transformer
, charlm
or nocharlm
in the save name - this lets us distinguish different complexities of model cc08458 for constituency, and others for the other models
Add the model's flags to the --help
for the run
scripts, such as 83c0901 7c171dd 8e1d112
Remove the dependency on six
6daf971 (thank you @BLKSerene )
New Models
VLSP constituency 500435d
VLSP constituency -> tagging cb0f22d
CTB 5.1 constituency f2ef62b
Add support for CTB 9.0, although those models are not distributed yet 1e3ea8a
Added an Indonesian charlm
Indonesian constituency from ICON treebank #1218
All languages with pretrained charlms now have an option to use that charlm for dependency parsing
French combined models out of GSD
, ParisStories
, Rhapsodie
, and Sequoia
ba64d37
UD 2.12 support 4f987d2