Corpus-Linguistics

A repository to store the scripts I write to make corpus linguistics (analyzing bodies of texts) work easier.

This allows you to:
- Create a dictionary that maps each part of speech (POS) to a dictionary containing words with that POS with their frequency, with a separate entry for each POS.
- Sort that dictionary by frequency or by words in alphabetical order
- Store that dictionary in a spreadsheet
Dependencies:
- openpyxl
- You need to have tagged txt files to use

This allows you to:
- Get the abstract nouns in a corpus by searching for various suffixes as found in the following website: https://learningenglishgrammar.wordpress.com/suffixes/suffixes-and-how-they-form-abstract-nouns/
- Sort that dictionary by frequency or by words in alphabetical order
- Store that dictionary in a spreadsheet
Dependencies:
- openpyxl
- You need wordlist files created using AntConc (http://www.laurenceanthony.net/software/antconc/)
Note:
- I have done nothing to correct for false positives (nouns that end with the suffixes I'm looking for but that aren't abstract nouns. If you find false positives, please let me know on the issues page, and I'll make a list of false positives.

Corpus linguistics programs that use nltk and treetagger but are not dependent on AntConc or TagAnt (no preprocessing required)
This allows:
- Calculating and storing word frequencies
- Keyword analysis
- Part of speech tagging and lemmatizing
- Aggregating data from multiple files (individually specify or give a directory)
- Import functions into your Python code, or use the graphical interface found in nlp_gui.py
Dependencies:
- nltk - main functionality
- openpyxl - for spreadsheets
- treetagger and treetaggerwrapper - part of speech tagging
- wxPython - for graphical interface
Note:
- Some functions are for my own personal use and will not work for you unless your filesystem is identical to mine.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
Fanfic_generator		Fanfic_generator
POS Spreadsheet		POS Spreadsheet
Standalone corpus linguistics		Standalone corpus linguistics
abstract_nouns		abstract_nouns
transcripts_parse_html		transcripts_parse_html
README.md		README.md

Provide feedback