This page aims at giving an overview of the analyses available within the Sparv Pipeline and the Sparv plugins developed by Språkbanken.
annotations are the names of the annotations as they are used in the corpus config file in the export.annotations
section (read more about this in the Sparv user
manual). Please observe
that the annotations usually have shorter names in the corpus exports.
annotators are the names of the annotation functions (including their module names) which are used for procuding the
annotations. They can be run directly with the sparv run-rule [annoterare]
command. Though in most cases this is not
necessary due to the fact that the annotation functions needed to produce all annotations listed in the corpus config
file are executed automatically when running sparv run
.
The following analyses are available in Sparv for contemporary Swedish. For various reasons not all of these are used by us at Språkbanken for analysing the corpora in Korp.
-
Sentence segmentation with PunktSentenceTokenizer
- description: Texts are split into sentences.
- model: punkt-nltk-svenska.pickle
- method: The model is build with NLTK's PunktTrainer and trained on StorSUC. The segmentation is done with NLTK's PunktSentenceTokenizer.
- annotations:
segment.sentence
: sentence segments
- annotators:
segment.sentence
-
Tokenization
- description: Sentence segments are split into tokens.
- model:
- configuration file bettertokenizer.sv
- word list
bettertokenizer.sv.saldo-tokens
built upon SALDOs morphology (it is built automatically by Sparv)
- method: Tokenizer using regular expressions and lists of words containing special characters and common abbreviations. Sparv's version is custom-made for Swedish but it is possible to configure it for other languages.
- annotations:
segment.token
: token segments
- annotators:
segment.tokenize
-
POS-tagging with Stanza
- description: Sentence segments are analysed to enrich tokens with part-of-speech tags and morphosyntactic information.
- tool: Stanza
- model: https://spraakbanken.gu.se/resurser/stanzamorph
- tagset:
- annotations:
<token>:stanza.pos
: part-of-speech tag<token>:stanza.msd
: morphosyntactic tag<token>:stanza.ufeats
: universal features
- annotators:
stanza.msdtag
-
Translation from SUC to UPOS
- description: SUC POS tags are translated to UPOS. Not used by default because the translations are not very reliable.
- model: Method has no model. A translation table is used.
- tagset: Universal POS tags
- annotations:
<token>:misc.upos
: UPOS (universal POS tags)
- annotators:
misc.upostag
-
POS-tagging with Hunpos
- description: Sentence segments are analysed to enrich tokens varje token with part-of-speech tags and morphosyntactic information. No longer used by default because Stanza's POS-tagging yields better results.
- tool: Hunpos
- model: suc3_suc-tags_default-setting_utf8.model
- method: The model is trained on SUC 3.0.
- tagset: SUC MSD tags
- annotations:
<token>:hunpos.msd
: morphosyntactic tag<token>:hunpos.pos
: part-of-speech tag
- annotators:
hunpos.msdtag
hunpos.postag
-
Dependency parsning with Stanza
- description: Sentence segments are analysed to enrich tokens with dependency information.
- tool: Stanza
- model: https://spraakbanken.gu.se/resurser/stanzasynt
- tagset: Mamba-Dep
- annotations:
<token>:stanza.ref
: the token position within the sentence<token>:stanza.dephead_ref
: dependency head, the ref of the word which the current word modifies or is dependent of<token>:stanza.deprel
: dependency relation, the relation of the current word to its dependency head
- annotators:
stanza.dep_parse
stanza.make_ref
-
Dependency parsning with MaltParser
- description: Sentence segments are analysed to enrich tokens with dependency information. No longer used by default because Stanza's dependency parsning yields better results.
- tool: MaltParser
- model: swemalt
- method: The model is trained on Svensk trädbank.
- tagset: Mamba-Dep
- annotations:
<token>:malt.ref
: the token position within the sentence<token>:malt.dephead_ref
: dependency head, the ref of the word which the current word modifies or is dependent of<token>:malt.deprel
: dependency relation, the relation of the current word to its dependency head
- annotators:
malt.annotate
malt.make_ref
-
Phrase structure parsing
- description: Mamba-Dep dependencies produced by the dependency analysis are converted to phrase structures. Not used in Korp due to incompatibility with Corpus Workbench.
- model: Method has no model.
- annotations:
phrase_structure.phrase
: phrase segmentsphrase_structure.phrase:phrase_structure.name
: name of the phrase segmentphrase_structure.phrase:phrase_structure.func
: function of the phrase segment
- annotators:
phrase_structure.annotate
-
SALDO-based analyses
- description: Tokens and their POS tags are looked up in the SALDO lexicon in order to enrich them with more information.
- model: SALDO morphology
- tagset: SALDO tags for lemgrams
- annotations:
<token>:saldo.baseform
: citation forms<token>:saldo.lemgram
: lemgrams, identifying the inflectional table<token>:saldo.sense
: identify senses in SALDO
- annotators:
saldo.annotate
-
Citation form analysis with Stanza
- description: Sentence segments are analysed to enrich tokens with citation forms. Not used in Korp. Citations forms are produced by SALDO instead.
- tool: Stanza
- model: https://spraakbanken.gu.se/resurser/stanzasynt
- annotations:
<token>:stanza.baseform
: citation form
- annotators:
stanza.annotate_swe
-
Sense disambiguation
- description: SALDO IDs from the
<token>:saldo.sense
-attributet are enriched with likelihoods. - tool: Sparv wsd
- dokumentation: Running the Koala word sense disambiguators
- model:
- annotations:
<token>:wsd.sense
: identifies senses in SALDO along with their likelihoods
- annotators:
wsd.annotate
- description: SALDO IDs from the
-
Compound analysis with SALDO
- description: Tokens and their POS tags are looked up in the SALDO lexicon in order to enrich them with compound information. More information (in Swedish) is found in the FAQ. Citation forms are enriched in this analysis.
- model:
- annotations:
<token>:saldo.complemgram
: compound lemgrams including a comparison score<token>:saldo.compwf
: compound word forms<token>:saldo.baseform2
: citation form
- annotators:
saldo.compound
-
Sentiment analysis with SenSALDO
- description: Tokens and their SALDO IDs are looked up in SenSALDO in order to enrich them with sentiments.
- model: SenSALDO
- annotations:
<token>:sensaldo.sentiment_label
: sentiment<token>:sensaldo.sentiment_score
: sentiment value
- annotators:
sensaldo.annotate
-
Named entity recognition with HFST-SweNER
- description: Sentence segments are analysed and enriched with named entities.
- tool: hfst-SweNER
- model: included in the tool
- referenser:
- tagset: HFST-SweNER tags
- annotations:
swener.ne
: named entity segmentswener.ne:swener.name
: text in the entire named entity segmentswener.ne:swener.ex
: named entity (name expression, numerical expression or time expression)swener.ne:swener.type
: named entity typeswener.ne:swener.subtype
: named entity subtype
- annotators:
swener.annotate
-
Readability metrics
- description: Documents are analysed in order to enrich them with readability metrics.
- model: Method has no model.
- annotations:
<text>:readability.lix
: the Swedish readability metric LIX, läsbarhetsindex<text>:readability.ovix
: the Swedish readability metric OVIX, ordvariationsindex<text>:readability.nk
: the Swedish readability metric nominalkvot (noun ratio)
- annotators:
readability.lix
readability.ovix
readability.nominal_ratio
-
Lexical classes
- description: Tokens are looked up in Blingbring and SweFN in order to enrich them with information about their lexical classes. Documents are then enriched with information about lexical classes based on which classes are common for the tokens within them.
- model:
- annotations:
<token>:lexical_classes.blingbring
: lexical class from the Blingbring resource per token<token>:lexical_classes.swefn
: frames from swedish FrameNet (SweFN) per token<text>:lexical_classes.blingbring
: lexical class from the Blingbring resource per dokument<text>:lexical_classes.swefn
: frames from swedish FrameNet (SweFN) per dokument
- annotators:
lexical_classes.blingbring_words
lexical_classes.swefn_words
lexical_classes.blingbring_text
lexical_classes.swefn_text
-
Geotagging
- description: Sentences (and paragraphs if existing) are enriched with place names (and their geographic coordinates) occurring within them. This is based on the place names found by the named entity tagger. Geographical coordinates are looked up in the GeoNames database.
- model: GeoNames
- annotations:
<sentence>:geo.geo_context
: places and their coordinates occurring within the sentence<paragraph>:geo.geo_context
: places and their coordinates occurring within the paragraph
- annotators:
geo.contextual
All analyses for contemporary Swedish are also available for this variety. Beyond that some analyses have been adapted for Swedish from the 1800's:
-
POS-tagging with Hunpos (adapted for 1800-talssvenska)
- description: Sentence segments are analysed to enrich tokens with part-of-speech tags and morphosyntactic information.
- tool: Hunpos
- model:
- suc3_suc-tags_default-setting_utf8.model
- a word list along with the words' morphosyntactig information generated from the Dalin morphology and the Swedberg morphology
- method: The model is trained on SUC 3.0.
- tagset: SUC MSD tags
- annotations:
<token>:hunpos.msd
: morphosyntactic tag<token>:hunpos.pos
: part-of-speech tag
- annotators:
hunpos.msdtag_hist
hunpos.postag
-
Lexicon-based analyses
- description: Tokens and their POS tags are looked up in different lexicons in order to enrich them with more information.
- model:
- tagset: SALDO tags (for lemgrams)
- annotations:
<token>:hist.baseform
: citation forms<token>:hist.sense
: identifies senses in SALDO<token>:hist.lemgram
: lemgrams, identifying the inflectional table<token>:hist.diapivot
: SALDO lemgrams from the diapivot model<token>:hist.combined_lemgrams
: SALDO lemgram, combined from SALDO, Dalin, Swedberg and the diapivot model
- annotators:
hist.annotate_saldo
hist.diapivot_annotate
hist.combine_lemgrams
All analyses for contemporary Swedish are available for this language variety, however, we do not recommend to use these due to the fact that the spelling often differs too much to give satisfying results. At Språkbanken we use the following analyses for texts written in Old Swedish:
-
Sentence segmentation and tokenization (same analyses as for contemporary Swedish)
-
Spelling variations
- description: Tokens are looked up in a model to get common spelling variations.
- model: model for Old Swedish spelling variations
- annotations:
<token>:hist.spelling_variants
: possible spelling variations for the token
- annotators:
hist.spelling_variants
-
Lexicon-based analyses
- description: Tokens and their POS tags are looked up in different lexicons in order to enrich them with more information.
- model:
- tagset: SALDO tags for lemgrams
- annotations:
<token>:hist.baseform
: citation forms<token>:hist.lemgram
: lemgrams, identifying the inflectional table<token>:hist.diapivot
: SALDO lemgrams from the diapivot model<token>:hist.combined_lemgrams
: SALDO lemgram, combined from SALDO, Dalin, Swedberg and the diapivot model
- annotators:
hist.annotate_saldo_fsv
hist.diapivot_annotate
hist.combine_lemgrams
-
Homograph sets
- description: A set of possible POS tags is extracted from the lemgram annotation.
- model: Method has no model.
- tagset: POS tags from the SUC MSD tag set
- annotations:
<token>:hist.homograph_set
: possible part-of-speech tags for the token
- annotators:
hist.extract_pos
Sparv supports analyses for a number of different languages. A list of which languages are supported and what analysis tools are available can be found here.
-
Analyses from TreeTagger
- description: Tokenised sentence segments are analysed to enrich tokens with more information.
- tool: TreeTagger
- model: Different language-dependent parameter files are used. Please check the TreeTagger web site for more information.
- tagset:
- Different language-dependent POS tag sets are used. Please check the TreeTagger web page for more information.
- Universal POS tags
- annotations:
<token>:treetagger.baseform
: citation form<token>:treetagger.pos
: part-of-speech tag, may include morphosyntactic information<token>:treetagger.upos
: universal part-of-speech tags, translated from<token>:treetagger.pos
- annotators:
treetagger.annotate
-
Analyses from FreeLing
- description: Entire documents are analysed with FreeLing for sentence segmentation, tokenization and enrichment with other information. FreeLing does not use the same permissive licence as Sparv. Installation of the Sparv FreeLing plugin is necessary.
- tool: FreeLing
- model: Models for different languages are included in the tool.
- tagset:
- Different language-dependent POS tagsets (often EAGLES). Please check the FreeLing documentation for more information.
- Universal POS tags
- annotations:
freeling.sentence
: sentence segments from FreeLingfreeling.token
: token segments from FreeLingfreeling.token:freeling.baseform
: citation formfreeling.token:freeling.pos
: part-of-speech tag, often including some morphosyntactic informationfreeling.token:freeling.upos
: universal part-of-speech tagsfreeling.token:freeling.ne_type
: named entity type (only available for some languages)
- annotators:
freeling.annotate
orfreeling.annotate_full
(depending on the language)
-
Analyses from Stanford Parser (for English)
- description: Entire documents are analysed with Stanford Parser for sentence segmentation, tokenization and enrichment with other information.
- tool: Stanford Parser
- model: included in the tool
- tagset:
- annotations:
stanford.sentence
: sentence segments from Stanford Parserstanford.token
: token segments from Stanford Parserstanford.token:stanford.baseform
: citation formstanford.token:stanford.pos
: part-of-speech tagstanford.token:stanford.upos
: universal part-of-speech tagsstanford.token:stanford.ne_type
: named entity typestanford.token:stanford.ref
: the token position within the sentencestanford.token:stanford.dephead_ref
: dependency head, the ref of the word which the current word modifies or is dependent ofstanford.token:stanford.deprel
: dependency relation, the relation of the current word to its dependency head
- annotators:
stanford.annotate
stanford.make_ref