Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training tools #6

Open
eastein opened this issue Nov 8, 2011 · 1 comment
Open

training tools #6

eastein opened this issue Nov 8, 2011 · 1 comment

Comments

@eastein
Copy link
Owner

eastein commented Nov 8, 2011

Optional argument to chatlight: dictionary file, ordered by counted occurences in a specific corpus. In JSON format like so:

[
["hello", 34],
["world", 56]
]

Where the "tuples" are a word and an incidence count in the corpus of text that's considered representative. Zero-incidence words that are considered valid but did not appear in the corpus should be indicated by a record with an incidence of 0, not be dropped.

Given this file and another path to write state to (an absolute filename whose directory already exists to read/write JSON to), record incidence in chat of non-categorized words.

Another utility should exist that takes the same two files and lists non-categorized words experienced in chat, ordered in some hybrid (TBD) of rarity in corpus and commonality in chat.

@eastein
Copy link
Owner Author

eastein commented Dec 1, 2011

Mike Katsevman's suggestion:

Dump as tf-idf that will guess conversation topics.
Use top collocations, with stopwords dumped out. i.e. most common pairs, or triples of words. Give a human the top 10, let them categorize.

This is a paraphrase of what I think he told me, so grain of salt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant