This repo includes code for converting Universal Dependencies-formalism trees into the syntactic formalism from the Cambridge Grammar of the English Language (CGEL). CGEL gold data in the repo is annotated by Brett Reynolds (@brettrey3 on Twitter, who also runs @DailySyntaxTree).
We have two portions of our resulting dataset: a small set of sentences with both gold CGEL and UD trees, and a larger set of trees from EWT with complete CGEL silver parses.
The gold data resides in 4 files:
datasets/{twitter.cgel, twitter_ud.conllu}
: CGEL gold trees from Twitter with corresponding UD trees (silver from Stanza then manually corrected by Nathan Schneider)datasets/{ewt.cgel, ewt_ud.conllu}
: UD gold trees from EWT train set, with corresponding CGEL trees (manually annotated by Brett Reynolds)
Both portions were revised with the aid of consistency-checking scripts.
Other subdirectories contain older/silver versions of the trees.
To load the CGEL trees for scripting, use the cgel.py
library.
cgel.py
: library that implements classes for CGEL trees and the nodes within them, incl. helpful functions for printing and processing trees in PENMAN notationclausetype.py
: enriches UD trees with CGEL clause type featuresconstituent.py
:graph.py
: generates figures for papersparse_forest.py
: parses original trees made by Brett Reynolds in LaTeX using theforest
package into machine-readable formatsparse.py
: ditto but for older trees using theparsetree
packageud_to_cgel.py
: converts UD trees (from English EWT treebank) to CGEL format using rule-based systemvalidate_trees.py
: script to check the well-formedness of trees
Folders
analysis/
: scripts for analysing the datasetsconversions/
: contains outputs and logs fromud_to_cgel.py
convertor/
: includes conversion rules in DepEdit script, with a simple Flask web interface for local testing in the browser (English text > automatic UD w/ Stanza > CGEL)datasets/
: all the final output datasets, incl. gold UD for the gold CGEL data (more detailed description TBD)figures/
: figures generated bygraph.py
trees/
: input trees in LaTeX