Skip to content

Multiple default models and a combined EN NER model

Compare
Choose a tag to compare
@AngledLuffa AngledLuffa released this 03 Oct 05:11
· 612 commits to main since this release

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

  • default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
  • default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
  • default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: #1287

addresses #1259 and #1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

#1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

  • Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline, you can now provide a callable via the tokenize_postprocessor parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline #1290

  • Finetuning for transformers in the NER models: have not yet found helpful settings, though 45ef544

  • SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code #1279 88cd0df

  • charlm for PT (improves accuracy on non-transformer models): c10763d

  • build models with transformers for a few additional languages: MR, AR, PT, JA 45b3875 0f3761e c55472a c10763d

Bugfixes