This repository contains an implementation of the parsers described in Arc-swift: A Novel Transition System for Dependency Parsing. If you use arc-swift in your work, please cite us with the BibTeX item below.
@inproceedings{qi2017arcswift,
title={Arc-swift: A Novel Transition System for Dependency Parsing},
author={Qi, Peng and Manning, Christopher D.},
booktitle={Proceedings of the 55th Annual Meeting of Association for Computational Linguistics},
year={2017}
}
Running the parsers requires Tensorflow 1.0 or above. The data preparation script also requires a working Java 8 installation to run Stanford CoreNLP. Other Python dependencies are included in requirements.txt
, and can be installed via pip
by running
pip install -r requirements.txt
We provide code for converting the Wall Street Journal section of Penn Treebank into Stanford Dependencies.
To use the code, you first need to obtain the corresponding parse trees from LDC, make the standard train/dev/test split (Sections 02-21 for training, 22 for development, and 23 for testing). Copy the splits to utils/ptb3-wsj
and name them ptb3-wsj-train.trees
, ptb3-wsj-dev.trees
, and ptb3-wsj-test.trees
, respectively, then run the following scripts
./setup_corenlp.sh 3.3
./convert_splits_to_depparse.sh
The first script downloads Stanford CoreNLP v3.3.0 in this directory, which is necessary for converting Penn Treebank parse trees to dependency parses in the second script.
To make the parse trees available for training, it is necessary to keep projective trees only in the training set. For this, go to utils/
and run
python filter_nonproj.py ptb3-wsj/ptb3-wsj-train.conllx ../src/ptb3-wsj-train.conllx
The dev and test sets shouldn't be altered, so we just copy them directly to the src
directory for later use.
cp ptb3-wsj/ptb3-wsj-dev.conllx ../src
cp ptb3-wsj/ptb3-wsj-test.conllx ../src
As a final step in data preparation, we would need to create a file that maps all dependency arc types and all part-of-speech (POS) types into integers. This can be achieved by running the following script under utils/
./create_mappings.sh ptb3-wsj/ptb3-wsj-train.conllx > ../src/mappings-ptb.txt
To train the parser, we would also need to create the oracle sequence of transitions for our parsers to follow. To do this, go to src/
, and run
python gen_oracle_seq.py ptb3-wsj-train.conllx train.ASw.seq --transsys ASw --mappings mappings-ptb.txt
Here, mappings-ptb.txt
is the mappings file we just created, ASw
stands for arc-swift, and train.ASw.seq
is the output file containing oracle transitions for the training data.
The processing steps of Universal Dependencies trees and that of Penn Treebank parses is very similar, modulo that conversion to dependency parses is not necessary. The Universal Dependencies v1.3 data used in the paper can be found here.
To train the parsers, you might want to download the pretrained GloVe vectors. For the experiments in the paper, we used the 100-dimensional embeddings trained on Wikipedia and Gigaword (glove.6B.zip). Download and unzip the GloVe files in src
, and to train the arc-swift parser, simply run
mkdir ASw_models
python train_parser.py ptb3-wsj-train.conllx --seq_file train.ASw.seq --wordvec_file glove.6B.100d.txt --vocab_file vocab.pickle --feat_file ptb_ASw_feats.pickle --model_dir ASw_models --transsys ASw --mappings_file mappings-ptb.txt
Note that if you're using the GPU, you might want to specify CUDA_VISIBLE_DEVICES
to tell Tensorflow which GPU device to use. vocab.pickle
and ptb_ASw_feats.pickle
are files that the training code will automatically generate and reuse should you want to train the parser more than one time with the same data. For more arguments the training code supports, run
python train_parser.py -h
To train parsers for other transition systems, simply replace the --transsys
argument with the short name for the transition system you are interested in.
Short name | Transition system |
---|---|
ASw | arc-swift |
ASd | arc-standard |
AES | arc-eager-Shift |
AER | arc-eager-Reduce |
AH | arc-hybrid |
To train the parsers on Universal Dependencies English Treebank, --epoch_multiplier 3
should also be used to reproduce the training settings described in the paper.
To evaluate the trained parsers, run
python eval_parser.py ptb3-wsj-dev.conllx --vocab_file vocab.pickle --model_dir ASw_models --transsys ASw --eval_dataset dev --mappings_file mappings-ptb
This will generate output files in the model directory with names like models_ASw/dev_eval_beam_1_output_epoch0.txt
, which contain the predicted dependency parses.
We also provide a python implementation of labelled and unlabelled attachment score evaluation. The interface is very similar to the CoNLL official script, simply run
cut -d" " -f1-6 ptb3-wsj-dev.conllx| paste - ASw_models/dev_eval_beam_1_output_epoch0.txt > system_pred.conllx
python eval.py -g ptb3-wsj-dev.conllx -s system_pred.conllx
where -g
stands for gold file, and -s
stands for system prediction. Note that the delimiter in the call to cut
is a tab character. By default the script removes punctuation according to the gold Penn Treebank POS tags of the tokens. To run generate results compatible with the CoNLL official evaluation script, make sure to use --language conll
.
All work contained in this package is licensed under the Apache License, Version 2.0. See the included LICENSE file.