training tools #6

eastein · 2011-11-08T05:48:57Z

Optional argument to chatlight: dictionary file, ordered by counted occurences in a specific corpus. In JSON format like so:

[
["hello", 34],
["world", 56]
]

Where the "tuples" are a word and an incidence count in the corpus of text that's considered representative. Zero-incidence words that are considered valid but did not appear in the corpus should be indicated by a record with an incidence of 0, not be dropped.

Given this file and another path to write state to (an absolute filename whose directory already exists to read/write JSON to), record incidence in chat of non-categorized words.

Another utility should exist that takes the same two files and lists non-categorized words experienced in chat, ordered in some hybrid (TBD) of rarity in corpus and commonality in chat.

The text was updated successfully, but these errors were encountered:

eastein · 2011-12-01T06:58:17Z

Mike Katsevman's suggestion:

Dump as tf-idf that will guess conversation topics.
Use top collocations, with stopwords dumped out. i.e. most common pairs, or triples of words. Give a human the top 10, let them categorize.

This is a paraphrase of what I think he told me, so grain of salt.

eastein mentioned this issue Jan 7, 2012

determine coherency of conversational subject #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training tools #6

training tools #6

eastein commented Nov 8, 2011

eastein commented Dec 1, 2011

training tools #6

training tools #6

Comments

eastein commented Nov 8, 2011

eastein commented Dec 1, 2011