This is a Japanese text corpus that consists of Wikipedia articles with various linguistic annotations.
The linguistic annotations consist of annotations of morphology, named entities, dependencies, predicate-argument
structures including zero anaphora, and coreferences.
For the annotation guidelines, see the manuals in the doc
directory of
the ku-nlp/KWDLC repository.
knp/
: the corpus annotated with morphology, named entities, dependencies, predicate-argument structures, and coreferencesorg/
: the raw corpusid/
: document id files providing train/dev/test split
Documents | Sentences | Morphemes | Named entities | Predicates | Coreferring mentions | |
---|---|---|---|---|---|---|
train | 2,080 | 4,841 | 118,289 | 8,144 | 32,034 | 26,852 |
dev | 100 | 248 | 6,353 | 423 | 1,702 | 1,435 |
test | 200 | 455 | 11,123 | 801 | 2,875 | 2,533 |
total | 2,380 | 5,544 | 135,765 | 9,368 | 36,611 | 30,820 |
Annotations of this corpus are given in the following format (a.k.a. the KNP format).
# S-ID:wiki000010000-1
* 2D
+ 3D
太郎 たろう 太郎 名詞 6 人名 5 * 0 * 0
は は は 助詞 9 副助詞 2 * 0 * 0
* 2D
+ 2D
京都 きょうと 京都 名詞 6 地名 4 * 0 * 0
+ 3D <NE:ORGANIZATION:京都大学>
大学 だいがく 大学 名詞 6 普通名詞 1 * 0 * 0
に に に 助詞 9 格助詞 1 * 0 * 0
* -1D
+ -1D <rel type="ガ" target="太郎" sid="w201106-0000010001-1" id="0"/><rel type="ニ" target="大学" sid="w201106-0000010001-1" id="2"/>
行った いった 行く 動詞 2 * 0 子音動詞カ行促音便形 3 タ形 10
EOS
A description of this format can be found in the documentation of KWDLC.
Note: You can use rhoknp to intuitively access annotations from Python without understanding the syntax of this format.
from rhoknp import Document
with open("knp/wiki0010/wiki00100176.knp") as f:
document = Document.from_knp(f.read())
for morpheme in document.morphemes:
...
- 萩行正嗣, 河原大輔, 黒橋禎夫. 多様な文書の書き始めに対する意味関係タグ付きコーパスの構築とその分析, 自然言語処理, Vol.21, No.2, pp.213-248, 2014. https://doi.org/10.5715/jnlp.21.213
京都大学 言語メディア研究室 (contact at nlp.ist.i.kyoto-u.ac.jp)
- Nobuhiro Ueda <ueda at nlp.ist.i.kyoto-u.ac.jp>
If you have any questions or problems with this corpus, please email to <nl-resource at nlp.ist.i.kyoto-u.ac.jp>.
The license for this corpus is subject to CC BY-SA 4.0. https://creativecommons.org/licenses/by-sa/4.0/