Skip to content

Commit

Permalink
Merge pull request #174 from amir-zeldes/dev
Browse files Browse the repository at this point in the history
V9.2.0
  • Loading branch information
amir-zeldes authored Nov 10, 2023
2 parents b153503 + 5967358 commit 3b0ab7d
Show file tree
Hide file tree
Showing 2,538 changed files with 1,017,224 additions and 504,783 deletions.
31 changes: 23 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ This repository contains release versions of the Georgetown University Multilaye
* textbooks
* vlogs

The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: https://gucorpling.org/gum.
The corpus is created as part of the course LING-4427 (Computational Corpus Linguistics) at Georgetown University. For more details see: https://gucorpling.org/gum.

## A note about Reddit data

Expand Down Expand Up @@ -76,6 +76,21 @@ If you are using the OntoNotes schema version of the coreference annotations (a.
address = {Bangkok, Thailand}
```

For papers focusing on named entities or entity linking (Wikification), please cite this paper instead:

```
@inproceedings{lin-zeldes-2021-wikigum,
title = {{W}iki{GUM}: Exhaustive Entity Linking for Wikification in 12 Genres},
author = {Jessica Lin and Amir Zeldes},
booktitle = {Proceedings of The Joint 15th Linguistic Annotation Workshop (LAW) and
3rd Designing Meaning Representations (DMR) Workshop (LAW-DMR 2021)},
year = {2021},
address = {Punta Cana, Dominican Republic},
url = {https://aclanthology.org/2021.law-1.18},
pages = {170--175},
}
```

For a full list of contributors please see [the corpus website](https://gucorpling.org/gum).

## Directories
Expand All @@ -85,18 +100,18 @@ The corpus is downloadable in multiple formats. Not all formats contain all anno
**NB: Reddit data in top folders does not inclulde the base text forms - consult README_reddit.md to add it**

* _build/ - The [GUM build bot](https://gucorpling.org/gum/build.html) and utilities for data merging and validation
* annis/ - The entire merged corpus, with all annotations, as a relANNIS 3.3 corpus dump, importable into [ANNIS](http://corpus-tools.org/annis)
* annis/ - The entire merged corpus (excl. Reddit), with all annotations, as a relANNIS 3.3 corpus dump, importable into [ANNIS](http://corpus-tools.org/annis)
* const/ - Constituent trees with function labels and PTB POS tags in the PTB bracketing format (automatic parser output from gold POS with functions projected from gold dependencies)
* coref/ - Entity and coreference annotation in two formats:
* conll/ - CoNLL shared task tabular format (with Wikification but no bridging or split antecedent annotations)
* tsv/ - WebAnno .tsv format, including entity type, salience and information status annotations, Wikification, bridging, split antecedent and singleton entities
* ontogum/ - alternative version of coreference annotation in CoNLL, tsv and CoNLL-U formats following OntoNotes guidelines (see Zhu et al. 2021)
* dep/ - Dependency trees using Universal Dependencies, enriched with metadata, summaries, sentence types, speaker information, enhanced dependencies, entities, information status, salience, centering, coreference, bridging, Wikification, XML markup, morphological tags and Universal POS tags according to the UD standard
* paula/ - The entire merged corpus in standoff [PAULA XML](https://github.com/korpling/paula-xml), with all annotations
* rst/ - Rhetorical Structure Theory analyses
* rstweb/ - full .rs3 format data as used by RSTTool and rstWeb (recommended)
* dep/ - Dependency trees using Universal Dependencies, enriched with metadata, summaries, sentence types, speaker information, enhanced dependencies, entities, information status, salience, centering, coreference, bridging, Wikification, XML markup, morphological tags/segmentation, CxG constructions, discourse relations/connectives/signals, and Universal POS tags according to the UD standard
* paula/ - The entire merged corpus (excl. Reddit) in standoff [PAULA XML](https://github.com/korpling/paula-xml), with all annotations
* rst/ - Enhanced Rhetorical Structure Theory (RST++) analyses
* rstweb/ - full .rs4 format data as used by RSTTool and rstWeb, with secondary edges + relation signals (recommended)
* lisp_nary/ - n-ary lisp trees (.dis format)
* lisp_binary/ - binarized lisp trees (.dis format)
* dependencies/ - a converted RST dependency representation (.rsd format)
* disrpt/ - plain segmentation and relation-per-line data formats following the DISRPT shared task specification
* xml/ - vertical XML representations with 1 token or tag per line, metadata, summaries and tab delimited lemmas and POS tags (extended VVZ style, vanilla, UPOS and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).
* disrpt/ - plain segmentation, connective detection and relation-per-line data formats following the DISRPT shared task specification
* xml/ - vertical XML representations with 1 token or tag per line, metadata, summaries and tab delimited lemmas, morphological segmentation and POS tags (extended VVZ style, vanilla, UPOS and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).
32 changes: 21 additions & 11 deletions _build/build_gum.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,13 @@ def setup_directories(gum_source, gum_target):
parser.add_argument("--pepper_only",action="store_true", help="Just rerun pepper on generated targets")
parser.add_argument("--skip_ptb_labels",action="store_true", help="Skip projecting function labels to PTB trees")
parser.add_argument("--skip_ontogum",action="store_true", help="Skip building OntoGUM version of coref data")
parser.add_argument("--no_secedges",action="store_true", help="No RST++ secedges in conllu")
parser.add_argument("--no_signals",action="store_true", help="No RST++ signals in conllu")
parser.add_argument("--corpus_name",action="store", default="GUM", help="Corpus name / document prefix")

options = parser.parse_args()

corpus_name = options.corpus_name
build_dir = os.path.dirname(os.path.realpath(__file__))
pepper_home = build_dir + os.sep + "utils" + os.sep + "pepper" + os.sep
pepper_tmp = pepper_home + "tmp" + os.sep
Expand Down Expand Up @@ -95,7 +99,7 @@ def setup_directories(gum_source, gum_target):
######################################
## Step 2: propagate annotations
######################################
from utils.propagate import enrich_dep, enrich_xml, compile_ud, tt2vanilla
from utils.propagate import enrich_dep, enrich_xml, compile_ud, tt2vanilla, fix_gw_tags
from utils.repair_tsv import fix_tsv, make_ontogum
from utils.repair_rst import fix_rst

Expand Down Expand Up @@ -195,6 +199,7 @@ def check_diff(xml, ptb, docname):
# Check and potentially correct POS tags and lemmas based on pooled annotations
#proof(gum_source)

conn_data = {}
if not options.pepper_only:
# Token and sentence border adjustments
print("\nAdjusting token and sentence borders:\n" + "="*37)
Expand All @@ -209,14 +214,14 @@ def check_diff(xml, ptb, docname):
# Adjust rst/ files:
# * refresh token strings in case of inconsistency
# * note that segment borders are not automatically adjusted around xml/ <s> elements
fix_rst(gum_source, gum_target, reddit=reddit)
conn_data = fix_rst(gum_source, gum_target, reddit=reddit)

# Add annotations to xml/:
# * add CLAWS tags in fourth column
# * add fifth column after lemma containing tok_func from dep/
# * add Centering Theory transition types to sentences
print("\n\nEnriching XML files:\n" + "="*23)
enrich_xml(gum_source, gum_target, centering_data, add_claws=options.claws, reddit=reddit)
enrich_xml(gum_source, gum_target, centering_data, add_claws=options.claws, reddit=reddit, corpus=corpus_name)

# Add annotations to dep/:
# * fresh token strings, POS tags and lemmas from xml/
Expand Down Expand Up @@ -246,7 +251,9 @@ def check_diff(xml, ptb, docname):
# * udapi does not support Python 2, meaning punctuation will be attached to the root if using Python 2
# * UD morphology generation relies on parses already existing in <target>/const/
print("\nCompiling Universal Dependencies version:\n" + "=" * 40)
compile_ud(pepper_tmp, gum_target, pre_annotated, reddit=reddit)
compile_ud(pepper_tmp, gum_target, pre_annotated, reddit=reddit, corpus=corpus_name)

fix_gw_tags(gum_target, reddit=reddit)

if not options.skip_ontogum:
# Create OntoGUM data (OntoNotes schema version of coref annotations)
Expand Down Expand Up @@ -288,7 +295,7 @@ def check_diff(xml, ptb, docname):

# Create Pepper staging erea in utils/pepper/tmp/
dirs = [('xml','xml','xml','', ''),('dep','ud','conllu','', os.sep + "ud" + os.sep + "not-to-release"),
('rst'+os.sep+'rstweb','rst','rs3','',''),('rst'+os.sep+'dependencies','rsd','rsd','',''),
('rst'+os.sep+'rstweb','rst','rs[34]','',''),('rst'+os.sep+'dependencies','rsd','rsd','',''),
('tsv','tsv','tsv','coref' + os.sep,''),('const','const','ptb','','')]
for dir in dirs:
files = []
Expand Down Expand Up @@ -348,19 +355,22 @@ def check_diff(xml, ptb, docname):
if not options.skip_ontogum:
if options.no_pepper:
sys.__stdout__.write("\ni Not adding entity information to UD parses in OntoGUM version since Pepper conversion was skipped\n")
add_entities_to_conllu(gum_target,reddit=reddit,ontogum=True)
else:
add_entities_to_conllu(gum_target,reddit=reddit,ontogum=True)
add_bridging_to_conllu(gum_target,reddit=reddit)
add_bridging_to_conllu(gum_target,reddit=reddit,corpus=corpus_name)

sys.__stdout__.write("\no Added entities, coreference and bridging to UD parses\n")

add_rsd_to_conllu(gum_target,reddit=reddit)
add_rsd_to_conllu(gum_target,reddit=reddit,ontogum=True)
add_xml_to_conllu(gum_target,reddit=reddit)
add_xml_to_conllu(gum_target,reddit=reddit,ontogum=True)
add_rsd_to_conllu(gum_target,reddit=reddit,output_signals=not options.no_signals,output_secedges=not options.no_secedges)
if not options.skip_ontogum:
add_rsd_to_conllu(gum_target,reddit=reddit,ontogum=True,output_signals=not options.no_signals,output_secedges=not options.no_secedges)
add_xml_to_conllu(gum_target,reddit=reddit,corpus=corpus_name)
if not options.skip_ontogum:
add_xml_to_conllu(gum_target,reddit=reddit,ontogum=True,corpus=corpus_name)

sys.__stdout__.write("\no Added discourse relations and XML tags to UD parses\n")

make_disrpt(reddit=reddit)
make_disrpt(conn_data,reddit=reddit)

sys.__stdout__.write("\no Created DISRPT shared task discourse relation formats in target rst/disrpt/\n")
4 changes: 2 additions & 2 deletions _build/src/const/GUM_academic_census.ptb
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
(ROOT (NP (CD 1) (NN Introduction)))
(ROOT (NP (LS 1) (NN Introduction)))

(ROOT
(S
Expand Down Expand Up @@ -457,7 +457,7 @@
(IN to)
(NP
(NP (DT a) (JJ particular) (NN field))
(PRN (-LRB- [) (CD 12) (NN –) (CD 16) (-RRB- ])))))
(PRN (-LRB- [) (CD 12) (SYM –) (CD 16) (-RRB- ])))))
(, ,)
(CC and)
(VP
Expand Down
4 changes: 2 additions & 2 deletions _build/src/const/GUM_academic_huh.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -825,7 +825,7 @@
(NP (NNP Extract) (CD 2) (NNP Siwu))
(PRN (-LRB- -LRB-) (NP (NNP Ghana)) (-RRB- -RRB-)))
(-LRB- [)
(SYM Maize1_1017013)
(NP (NNP Maize1_1017013))
(-RRB- ])))

(ROOT
Expand All @@ -834,7 +834,7 @@
(NP (NNP Extract) (CD 3) (NNP Lao))
(PRN (-LRB- -LRB-) (NP (NNP Laos)) (-RRB- -RRB-)))
(-LRB- [)
(NP (SYM CONV_050815c_03.10))
(NP (NNP CONV_050815c_03.10))
(-RRB- ])))

(ROOT
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_bio_moreau.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@
(NP (JJ French) (NN pronunciation))
(: :)
(-LRB- [)
(NP (SYM ʒan) (SYM mɔʁo))
(NP (NNP ʒan) (NNP mɔʁo))
(-RRB- ])
(: ;)
(NP
Expand Down
4 changes: 2 additions & 2 deletions _build/src/const/GUM_conversation_christmas.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -688,7 +688,7 @@

(ROOT (INTJ (UH Oh) (. .)))

(ROOT (FRAG (UH Thanks) (NP (NNP Judy)) (. .)))
(ROOT (FRAG (NNS Thanks) (NP (NNP Judy)) (. .)))

(ROOT (S (INTJ (UH Oh)) (NP (DT that)) (VP (VBZ 's)) (: —)))

Expand Down Expand Up @@ -934,7 +934,7 @@

(ROOT (INTJ (UH Hm) (. .)))

(ROOT (INTJ (UH Oh) (UH good) (. .)))
(ROOT (INTJ (UH Oh) (JJ good) (. .)))

(ROOT (INTJ (UH Yeah) (. .)))

Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_conversation_erasmus.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -994,7 +994,7 @@

(ROOT
(S
(NP (DT Such) (NNS arguments))
(NP (JJ Such) (NNS arguments))
(, ,)
(PP (IN on) (NP (DT the) (NN contrary)))
(, ,)
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_conversation_family.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -1242,7 +1242,7 @@
(VBD was)
(ADJP
(RB so)
(VBN worried)
(JJ worried)
(PP (IN about) (NP (NNP Jonathan))))))))
(. .)))

Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_fiction_giants.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -682,7 +682,7 @@
(S
(S
(NP (PRP$ her) (NNS parents))
(DT both)
(ADVP (RB both))
(VP (VBD worked) (PP (IN in) (NP (DT the) (NNS vineyards)))))
(, ,)
(CC and)
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_fiction_pag.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -1452,7 +1452,7 @@
(VP
(VBZ 's)
(SBAR
(JJ like)
(IN like)
(, ,)
(S
(NP (PRP$ your) (NN mom) (CC and) (NN dad))
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_fiction_sneeze.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -597,7 +597,7 @@
(VP
(VBZ yacks)
(PRT (RB on))
(PP (RB like) (NP (DT that)))
(PP (IN like) (NP (DT that)))
(PP (IN in) (NP (DT an) (NN exam))))))
(, ,)
(NP (PRP I))
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_fiction_time.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -1012,7 +1012,7 @@
(VP
(VBD was)
(VP
(VP (VBN corrugated))
(VP (JJ corrugated))
(CC and)
(VP
(VBN ornamented)
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_fiction_veronique.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -714,7 +714,7 @@
(S
(ADVP (RB Sometimes))
(NP (EX there))
(VP (VBD was) (NP (NN gunfire)) (ADVP (IN outside))))
(VP (VBD was) (NP (NN gunfire)) (ADVP (RB outside))))
(CC and)
(S
(NP (PRP we))
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_interview_chomsky.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -418,7 +418,7 @@

(ROOT
(S
(ADVP (UH Now))
(ADVP (RB Now))
(NP (PRP it))
(VP
(VBZ is)
Expand Down
4 changes: 2 additions & 2 deletions _build/src/const/GUM_interview_mckenzie.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@
(S
(-LRB- [)
(NP
(NP (DT The) (NN Spitfire) (NN Tournament))
(NP (DT The) (NNP Spitfire) (NNP Tournament))
(PP (IN in) (NP (NNP Canada))))
(-RRB- ])
(VP
Expand All @@ -101,7 +101,7 @@
(NP
(NP (DT a) (NN tournament))
(SBAR
(WHNP (IN that))
(WHNP (WDT that))
(S
(NP (PRP I))
(VP
Expand Down
4 changes: 2 additions & 2 deletions _build/src/const/GUM_interview_messina.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
(S
(NP (PRP I))
(VP
(VBP m)
(VBP 'm)
(RB not)
(NP
(NP (NNP Michael) (NNP Jackson))
Expand Down Expand Up @@ -718,7 +718,7 @@
(WHNP (WDT that))
(S
(VP
(VBZ s)
(VBZ 's)
(VP
(VBN battled)
(ADVP (RB hard))
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_interview_shalev.ptb
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(ROOT
(S
(NP (NN Wikinews))
(NP (NNP Wikinews))
(VP
(VBZ interviews)
(NP
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_news_expo.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -519,7 +519,7 @@
(VP
(VBG collecting)
(NP (NNS comics))
(PP (NN on) (HYPH -) (RB site))))))
(PP (IN on) (HYPH -) (RB site))))))
(, ,)
(NP (DT the) (NN group))
(VP
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_news_ie9.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,7 @@
(PP (IN over) (NP (DT the) (JJ past) (CD 30) (NNS days)))
(NP (PRP it))
(VP
(VBZ s)
(VBZ 's)
(VP
(VBN been)
(UCP
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_news_imprisoned.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -358,7 +358,7 @@
(VP
(VBG working)
(PP
(PP (IN from) (NP (CD 6:00am)))
(PP (IN from) (NP (CD 6:00) (NN am)))
(PP (IN to) (NP (NN midnight))))))
(CC and)
(VP
Expand Down
6 changes: 3 additions & 3 deletions _build/src/const/GUM_news_nasa.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -926,7 +926,7 @@
(S
(NP (PRP I))
(VP
(VBP m)
(VBP m)
(ADJP
(RB deeply)
(VBN disappointed)
Expand Down Expand Up @@ -1075,7 +1075,7 @@
(, ,)
(NP (PRP I))
(VP
(VBP m)
(VBP m)
(ADJP
(VBN disappointed)
(SBAR
Expand Down Expand Up @@ -1269,7 +1269,7 @@
(S
(NP (PRP I))
(VP
(VBP m)
(VBP m)
(ADJP
(RB deeply)
(VBN disappointed)
Expand Down
2 changes: 1 addition & 1 deletion _build/src/const/GUM_news_warhol.ptb
Original file line number Diff line number Diff line change
Expand Up @@ -2075,7 +2075,7 @@
(, ,)
(NP
(NP (JJ many) (NNS people))
(SBAR (S (NP (NN Wikinews)) (VP (VBD observed)))))
(SBAR (S (NP (NNP Wikinews)) (VP (VBD observed)))))
(VP
(VP
(VBD took)
Expand Down
Loading

0 comments on commit 3b0ab7d

Please sign in to comment.