Skip to content

BrettRey/cgel

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cgel

This repo includes code for converting Universal Dependencies-formalism trees into the syntactic formalism from the Cambridge Grammar of the English Language (CGEL). CGEL gold data in the repo is annotated by Brett Reynolds (@brettrey3 on Twitter, who also runs @DailySyntaxTree).

Status

Datasets

We have two portions of our resulting dataset: a small set of sentences with both gold CGEL and UD trees, and a larger set of trees from EWT with complete CGEL silver parses.

The gold data resides in 4 files:

  • datasets/{twitter.cgel, twitter_ud.conllu}: CGEL gold trees from Twitter with corresponding UD trees (silver from Stanza then manually corrected by Nathan Schneider)
  • datasets/{ewt.cgel, ewt_ud.conllu}: UD gold trees from EWT train set, with corresponding CGEL trees (manually annotated by Brett Reynolds)

Both portions were revised with the aid of consistency-checking scripts.

Other subdirectories contain older/silver versions of the trees.

To load the CGEL trees for scripting, use the cgel.py library.

Structure

  • cgel.py: library that implements classes for CGEL trees and the nodes within them, incl. helpful functions for printing and processing trees in PENMAN notation
  • clausetype.py: enriches UD trees with CGEL clause type features
  • constituent.py:
  • graph.py: generates figures for papers
  • parse_forest.py: parses original trees made by Brett Reynolds in LaTeX using the forest package into machine-readable formats
  • parse.py: ditto but for older trees using the parsetree package
  • ud_to_cgel.py: converts UD trees (from English EWT treebank) to CGEL format using rule-based system
  • validate_trees.py: script to check the well-formedness of trees

Folders

  • analysis/: scripts for analysing the datasets
  • conversions/: contains outputs and logs from ud_to_cgel.py
  • convertor/: includes conversion rules in DepEdit script, with a simple Flask web interface for local testing in the browser (English text > automatic UD w/ Stanza > CGEL)
  • datasets/: all the final output datasets, incl. gold UD for the gold CGEL data (more detailed description TBD)
  • figures/: figures generated by graph.py
  • trees/: input trees in LaTeX

About

CGEL trees.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TeX 74.9%
  • Python 23.8%
  • HTML 1.3%