Skip to content

A repository to store the scripts I write to make corpus linguistics (analyzing bodies of texts) work easier.

Notifications You must be signed in to change notification settings

cmd16/Corpus-Linguistics

Repository files navigation

Corpus-Linguistics

A repository to store the scripts I write to make corpus linguistics (analyzing bodies of texts) work easier.

POS Spreadsheet

  • This allows you to:
    • Create a dictionary that maps each part of speech (POS) to a dictionary containing words with that POS with their frequency, with a separate entry for each POS.
    • Sort that dictionary by frequency or by words in alphabetical order
    • Store that dictionary in a spreadsheet
  • Dependencies:
    • openpyxl
    • You need to have tagged txt files to use

abstract_nouns

remove_html_tags

  • This allows you to:
    • Remove all html tags from a file

Standalone corpus linguistics

  • Corpus linguistics programs that use nltk and treetagger but are not dependent on AntConc or TagAnt (no preprocessing required)
  • This allows:
    • Calculating and storing word frequencies
    • Keyword analysis
    • Part of speech tagging and lemmatizing
    • Aggregating data from multiple files (individually specify or give a directory)
    • Import functions into your Python code, or use the graphical interface found in nlp_gui.py
  • Dependencies:
    • nltk - main functionality
    • openpyxl - for spreadsheets
    • treetagger and treetaggerwrapper - part of speech tagging
    • wxPython - for graphical interface
  • Note:
    • Some functions are for my own personal use and will not work for you unless your filesystem is identical to mine.

About

A repository to store the scripts I write to make corpus linguistics (analyzing bodies of texts) work easier.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages