Merge pull request #32 from amir-zeldes/dev

V4.0.0
amir-zeldes · Mar 1, 2018 · fdd5479 · fdd5479
2 parents e39c2d0 + f4d15a8
commit fdd5479
Show file tree

Hide file tree

Showing 2,157 changed files with 2,325,222 additions and 1,649,428 deletions.
diff --git a/LICENSE.txt b/LICENSE.txt
diff --git a/README.md b/README.md
@@ -1,15 +1,31 @@
-# gum
+# GUM
 Repository for the Georgetown University Multilayer Corpus (GUM)
 
-This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from four text types (interviews, news, travel guides, instructional texts). The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
+This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from eight text types:
+
+  * interviews
+  * news
+  * travel guides
+  * how-to guides
+  * academic writing
+  * biographies
+  * fiction
+  * online forum discussions
+
+The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
+
+## A note about reddit data
+For one of the eight text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run `_build/process_reddit.py`, then `run _build/build_gum.py`. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.
 
 ## Citing
 To cite this corpus, please refer to the following article:
 
 Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612. 
 
 ## Directories
-The corpus is downloadable in multiple formats. Not all formats contain all annotations. The most complete XML representation is in PAULA XML, and the easiest way to search in the corpus is using ANNIS. Other formats may be useful for other purposes. See website for more details.
+The corpus is downloadable in multiple formats. Not all formats contain all annotations. The most complete XML representation is in [PAULA XML](https://www.sfb632.uni-potsdam.de/en/paula.html), and the easiest way to search in the corpus is using [ANNIS](http://corpus-tools.org/annis). Other formats may be useful for other purposes. See website for more details.
+
+**NB: reddit data is not included in top folders - consult README_reddit.md to add it**
 
   * _build/ - The [GUM build bot](https://corpling.uis.georgetown.edu/gum/build.html) and utilities for data merging and validation
   * annis/ - The entire merged corpus, with all annotations, as a relANNIS 3.3 corpus dump, importable into [ANNIS](http://corpus-tools.org/annis)
@@ -19,7 +35,7 @@ The corpus is downloadable in multiple formats. Not all formats contain all anno
     * tsv/ - WebAnno .tsv format, including entity and information status annotations, bridging and singleton entities
   * dep/ - Dependency trees of two kinds:
     * stanford/ - Original Stanford Typed Dependencies (manually corrected) in the CoNLLX 10 column format with extended PTB POS tags (following TreeTagger/Amalgam, e.g. tags like VVZ), as well as speaker and sentence type annotations
-    * ud/ - Universal Dependencies data, automatically converted from the Stanford Typed Dependency data, enriched with automatic morphological tags and Universal POS tags according to the UD standard
+    * ud/ - Universal Dependencies data, automatically converted from the gold Stanford Typed Dependency data, enriched with automatic morphological tags and Universal POS tags according to the UD standard
   * paula/ - The entire merged corpus in standoff [PAULA XML](https://www.sfb632.uni-potsdam.de/en/paula.html), with all annotations
   * rst/ - Rhetorical Structure Theory analyses in .rs3 format as used by RSTTool and rstWeb (spaces between words correspond to tokenization in rest of corpus)
   * xml/ - vertical XML representations with 1 token or tag per line and tab delimited lemmas and POS tags (extended, VVZ style, vanilla and CLAWS5, as well as dependency functions), compatible with the IMS Corpus Workbench (a.k.a. TreeTagger format).
diff --git a/README_reddit.md b/README_reddit.md
@@ -0,0 +1,25 @@
+# Data from reddit
+
+For one of the text types in this corpus, reddit forum discussions, plain text data is not supplied in this repository. To obtain this data, please follow the instructions below.
+
+## Annotations
+
+Documents in the reddit subcorpus are named GUM_reddit_* (e.g. GUM_reddit_superman) and are *not* included in the root folder with all annotation layers. The annotations for the reddit subcorpus can be found together with all other document annotations in `_build/src/`. Token representations in these files are replaced with underscores, while the annotations themselves are included in the files. To compile the corpus including reddit data, you must obtain the underlying texts.
+
+## Obtaining underlying reddit text data
+
+To recover reddit data, use the API provided by the script `_build/process_reddit.py`. If you have your own credentials for use with the Python reddit API wrapper (praw) and Google bigquery, you should include them in two files, `praw.txt` and `key.json` in `_build/utils/get_reddit/`. For this to work, you must have the praw and bigquery libraries installed for python (e.g. via pip). You can then run `python _build/process_reddit.py` to recover the data, and proceed to the next step, re-building the corpus.
+
+Alternatively, if you can't use praw/bigquery, the script `_build/process_reddit.py` will offer to download the data for you by proxy. To do this, run the script and confirm that you will only use the data according to the terms and conditions determined by reddit, and for non-commercial purposes. The script will then download the data for you - if the download is successful, you can continue to the next step and re-build the corpus.
+
+## Rebuilding the corpus with reddit data
+
+To compile all projected annotations and produce all formats not included in `_build/src/`, you will need to run the GUM build bot: `python _build/build_gum.py`. This process is described in detail at https://corpling.uis.georgetown.edu/gum/build.html, but summarized instructions follow.
+
+At a minumum, you can run `python _build/build_gum.py` with no options. This will produce basic formats in `_build/target/`, but skip generating fresh constituent parses, CLAWS5 tags and the Universal Dependencies version of the dependency data. To include these you will need:
+
+  * CLAWS5: use option -c and ensure that utils/paths.py points to an executable for the TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). The CLAWS 5 parameter file is already included in utils/treetagger/lib/, and tags are auto-corrected by the build bot based on gold PTB tags.
+  * Constituent parses: option -p; ensure that paths.py correctly points your installation of the Stanford Parser/CoreNLP
+  * Universal Dependencies: option -u; ensure the paths.py points to CoreNLP, and that you have installed udapi and depedit (pip install udapi; pip install depedit). Note that this only works with Python 3.
+
+If you run into problems building the corpus, feel free to report an issue via GitHub or contact us via e-mail.
diff --git a/_build/build_gum.py b/_build/build_gum.py
@@ -13,6 +13,7 @@
 
 PY2 = sys.version_info[0] < 3
 
+
 def setup_directories(gum_source, gum_target):
 	if not os.path.exists(gum_source):
 		raise IOError("Source file directory " + gum_source + " not found.")
@@ -32,6 +33,7 @@ def setup_directories(gum_source, gum_target):
 parser.add_argument("-s",dest="source",action="store",help="GUM build source directory", default="src")
 parser.add_argument("-p",dest="parse",action="store_true",help="Whether to reparse constituents")
 parser.add_argument("-c",dest="claws",action="store_true",help="Whether to reassign claws5 tags")
+parser.add_argument("-u",dest="unidep",action="store_true",help="Whether to create a Universal Dependencies version")
 parser.add_argument("-v",dest="verbose_pepper",action="store_true",help="Whether to print verbose pepper output")
 parser.add_argument("-n",dest="no_pepper",action="store_true",help="No pepper conversion, just validation and file fixing")
 parser.add_argument("-i",dest="increment_version",action="store",help="A new version number to assign",default="DEVELOP")
@@ -41,25 +43,40 @@ def setup_directories(gum_source, gum_target):
 gum_source = os.path.abspath(options.source)
 gum_target = os.path.abspath(options.target)
 
-
 if gum_source[-1] != os.sep:
 	gum_source += os.sep
 if gum_target[-1] != os.sep:
 	gum_target += os.sep
 setup_directories(gum_source,gum_target)
 
-
 ######################################
 ## Step 1:
 ######################################
 # validate input for further steps
-from utils.validate import validate_src
+from utils.validate import validate_src, check_reddit
 
 print("="*20)
 print("Validating files...")
 print("="*20 + "\n")
 
-validate_src(gum_source)
+reddit = check_reddit(gum_source)
+if not reddit:
+	print("Could not find restored tokens in reddit documents.")
+	print("Abort conversion or continue without reddit? (You can restore reddit tokens using process_reddit.py)")
+	try:
+		# for python 2
+		response = raw_input("[A]bort/[C]ontinue> ")
+	except NameError:
+		# for python 3
+		response = input("[A]bort/[C]ontinue> ")
+	if response.upper() != "C":
+		print("Aborting build.")
+		sys.exit()
+else:
+	print("Found reddit source data")
+	print("Including reddit data in build")
+
+validate_src(gum_source, reddit=reddit)
 
 ######################################
 ## Step 2: propagate annotations
@@ -78,31 +95,31 @@ def setup_directories(gum_source, gum_target):
 #   * generates vanilla tags in CPOS column from POS
 #   * creates speaker and s_type comments from xml/
 print("\nEnriching Dependencies:\n" + "="*23)
-enrich_dep(gum_source, gum_target)
+enrich_dep(gum_source, gum_target, reddit)
 
 # Add annotations to xml/:
 #   * add CLAWS tags in fourth column
 #   * add fifth column after lemma containing tok_func from dep/
 print("\nEnriching XML files:\n" + "="*23)
-enrich_xml(gum_source, gum_target, options.claws)
+enrich_xml(gum_source, gum_target, add_claws=options.claws, reddit=reddit)
 
 # Token and sentence border adjustments
 print("\nAdjusting token and sentence borders:\n" + "="*37)
 # Adjust tsv/ files:
 #   * refresh and re-merge token strings in case they were mangled by WebAnno
 #   * adjust sentence borders to match xml/ <s>-tags
-fix_tsv(gum_source,gum_target)
+fix_tsv(gum_source, gum_target, reddit=reddit)
 
 # Adjust rst/ files:
 #   * refresh token strings in case of inconsistency
 #   * note that segment borders are not automatically adjusted around xml/ <s> elements
-fix_rst(gum_source,gum_target)
+fix_rst(gum_source, gum_target, reddit=reddit)
 
 # Create fresh constituent parses in const/ if desired
 # (either reparse or use dep2const conversion, e.g. https://github.com/ikekonglp/PAD)
 if options.parse:
 	print("\nRegenerating constituent trees:\n" + "="*30)
-	const_parse(gum_source,gum_target)
+	const_parse(gum_source, gum_target, reddit=reddit)
 else:
 	sys.stdout.write("\ni Skipping fresh parse for const/\n")
 	if not os.path.exists(gum_target + "const"):
@@ -117,12 +134,13 @@ def setup_directories(gum_source, gum_target):
 #   * UD punctuation guidelines are enforced using udapi, which must be installed to work
 #   * udapi does not support Python 2, meaning punctuation will be attached to the root if using Python 2
 #   * UD morphology generation relies on parses already existing in <target>/const/
-print("\nCreating Universal Dependencies version:\n" + "=" * 40)
-if PY2:
-	print("WARN: Running on Python 2 - consider upgrading to Python 3. ")
-	print("      Punctuation behavior in the UD conversion relies on udapi ")
-	print("      which does not support Python 2. All punctuation will be attached to sentence roots.\n")
-create_ud(gum_target)
+if options.unidep:
+	print("\nCreating Universal Dependencies version:\n" + "=" * 40)
+	if PY2:
+		print("WARN: Running on Python 2 - consider upgrading to Python 3. ")
+		print("      Punctuation behavior in the UD conversion relies on udapi ")
+		print("      which does not support Python 2. All punctuation will be attached to sentence roots.\n")
+	create_ud(gum_target, reddit=reddit)
 
 ## Step 3: merge and convert source formats to target formats
 if options.no_pepper:
@@ -132,10 +150,15 @@ def setup_directories(gum_source, gum_target):
 
 	# Create Pepper staging erea in utils/pepper/tmp/
 	pepper_home = "utils" + os.sep + "pepper" + os.sep
-	dirs = [('xml','xml',''),('dep','conll10',''),('rst','rs3',''),('tsv','tsv','coref' + os.sep),('const','ptb','')]
+	dirs = [('xml','xml','', ''),('dep','conll10','', os.sep + "stanford"),('rst','rs3','',''),('tsv','tsv','coref' + os.sep,''),('const','ptb','','')]
 	for dir in dirs:
-		dir_name, extension, prefix = dir
-		files = glob(gum_target + prefix + dir_name + os.sep + "*" + extension)
+		files = []
+		dir_name, extension, prefix, suffix = dir
+		files_ = glob(gum_target + prefix + dir_name + suffix + os.sep + "*" + extension)
+		for file_ in files_:
+			if not reddit and "reddit_" in file_:
+				continue
+			files.append(file_)
 		pepper_tmp = pepper_home + "tmp" + os.sep
 		if not os.path.exists(pepper_tmp + dir_name + os.sep + "GUM" + os.sep):
 			os.makedirs(pepper_tmp + dir_name + os.sep + "GUM" + os.sep)

diff --git a/_build/process_reddit.py b/_build/process_reddit.py
@@ -0,0 +1,33 @@
+import io, os, sys
+from utils.get_reddit.fetch_text import run_fetch
+from argparse import ArgumentParser
+from utils.get_reddit.underscores import underscoring, deunderscoring
+
+PY3 = sys.version_info[0] == 3
+
+if not PY3:
+	reload(sys)
+	sys.setdefaultencoding('utf8')
+
+if __name__ == "__main__":
+
+	parser = ArgumentParser()
+	parser.add_argument("-m","--mode",action="store",choices=["add","del"],default="add",help="Whether to add reddit token data or delete it")
+
+	options = parser.parse_args()
+
+	src_dir = os.path.dirname(os.path.realpath(__file__)) + os.sep + "src" + os.sep
+
+	if options.mode == "add":
+
+		textdic = run_fetch()
+
+		deunderscoring(src_dir, textdic)
+
+		print("Completed fetching reddit data.")
+		print("You can now use build_gum.py to produce all annotation layers.")
+
+	elif options.mode == "del":
+
+		underscoring(src_dir)
+		print("Tokens in reddit data have been replaced with underscores.")