A repository to store the scripts I write to make corpus linguistics (analyzing bodies of texts) work easier.
- This allows you to:
- Create a dictionary that maps each part of speech (POS) to a dictionary containing words with that POS with their frequency, with a separate entry for each POS.
- Sort that dictionary by frequency or by words in alphabetical order
- Store that dictionary in a spreadsheet
- Dependencies:
- openpyxl
- You need to have tagged txt files to use
- This allows you to:
- Get the abstract nouns in a corpus by searching for various suffixes as found in the following website: https://learningenglishgrammar.wordpress.com/suffixes/suffixes-and-how-they-form-abstract-nouns/
- Sort that dictionary by frequency or by words in alphabetical order
- Store that dictionary in a spreadsheet
- Dependencies:
- openpyxl
- You need wordlist files created using AntConc (http://www.laurenceanthony.net/software/antconc/)
- Note:
- I have done nothing to correct for false positives (nouns that end with the suffixes I'm looking for but that aren't abstract nouns. If you find false positives, please let me know on the issues page, and I'll make a list of false positives.
- This allows you to:
- Remove all html tags from a file
- Corpus linguistics programs that use nltk and treetagger but are not dependent on AntConc or TagAnt (no preprocessing required)
-
This allows:
- Calculating and storing word frequencies
- Keyword analysis
- Part of speech tagging and lemmatizing
- Aggregating data from multiple files (individually specify or give a directory)
- Import functions into your Python code, or use the graphical interface found in nlp_gui.py
-
Dependencies:
- nltk - main functionality
- openpyxl - for spreadsheets
- treetagger and treetaggerwrapper - part of speech tagging
- wxPython - for graphical interface
-
Note:
- Some functions are for my own personal use and will not work for you unless your filesystem is identical to mine.