This code accompanies the thesis on embedding-based extractive summarization from Blendle Research, written by Lucas de Haas. It can be used to exactly reproduce all experimental results. It thus contains implementations of various summarization algorithms that were previously not available.
Set the summarization function(s) in summarizer.py, and then run main.py to output results.
Some files are not included:
- The Google word2vec model is not included in this repo, but can be downloaded here; it is expected to be in models/word2vec/google/, and is necessary to run main.py out-of-the-box.
- The DUC-2002 and TAC-2008 dataset are not included as access can only be granted by NIST (click on the links for more information on obtaining access).
- The Opinosis dataset is included, and main.py is configured to run on this dataset by default.
- python >= 3.5
- pythonrouge
- regex
- scipy
- networkx
- gensim
- xmltodict
- numpy
- pattern
- nltk
- beautifulsoup4
- scikit_learn
- torch
- permute