Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
amir-zeldes committed Mar 1, 2018
1 parent 669f24d commit 865bfa7
Showing 1 changed file with 13 additions and 2 deletions.
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,18 @@
# GUM
Repository for the Georgetown University Multilayer Corpus (GUM)

This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from eight text types (interviews, news, travel guides, how-to guides, academic writing, biographies, fiction and forum discussions). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from eight text types:

* interviews
* news
* travel guides
* how-to guides
* academic writing
* biographies
* fiction
* online forum discussions

The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.

## A note about reddit data
For one of the eight text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run _build/process_reddit.py, then run _build/build_gum.py. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See README_reddit.md for more details.
Expand All @@ -12,7 +23,7 @@ To cite this corpus, please refer to the following article:
Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612.

## Directories
The corpus is downloadable in multiple formats. Not all formats contain all annotations. The most complete XML representation is in PAULA XML, and the easiest way to search in the corpus is using ANNIS. Other formats may be useful for other purposes. See website for more details.
The corpus is downloadable in multiple formats. Not all formats contain all annotations. The most complete XML representation is in [PAULA XML](https://www.sfb632.uni-potsdam.de/en/paula.html), and the easiest way to search in the corpus is using [ANNIS](http://corpus-tools.org/annis). Other formats may be useful for other purposes. See website for more details.

**NB: reddit data is not included in top folders - consult README_reddit.md to add it**

Expand Down

0 comments on commit 865bfa7

Please sign in to comment.