Implement service to return term counts #24

lintool · 2013-04-17T16:05:23Z

We need a service to return time counts within a certain interval. Need to decide:

Actual implementation (separate service? squeeze into current service?)
Granularity?
Just unigrams? Arbitrary n-grams as well?
Impact on efficiency?

lintool · 2013-04-17T16:11:35Z

We might not even need a service:

if we stored term counts in this way: termid -> [ vector of counts... ]
we can definitely post the file publicly, separately distribute term to termid mapping

the format would be pretty much identical to the google books datasets
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

amjedbj · 2013-04-17T16:22:39Z

We need to retrun for each term:

tf: frequency of the term in the document
df: number of documents that contain the term (Time aware)
cf: frequency of the term in the dataset (Time aware)

stewhdcs · 2013-04-17T23:48:17Z

It would probably be useful to do this unigrams and bigrams. The size of the file could be reduced by filtering out low frequencies overall the collection, or per 'bucket' period.

We can specify buckets every N hours from the start of the corpus. N = 4/6/12 hours would probably be more than enough. At least with a smaller than necessary interval, people can easily aggregate intervals together as necessary using integer division on the bucket offset.

We would also need the background model of document frequencies in each bucket so we can compute term probabilities as well.

amjedbj · 2013-04-19T13:17:06Z

What about tweet and term statistics of the current index. Some IR baslines requires collection statistics such as average tweet length (i.e. Okapi BM25). This is a non-exhaustive list of index stats:

tf_t,d: Frequency of query term t in tweet d
pos_t,d: Position of query term t in tweet d
len_d: Number of terms in tweet d
N: Number of tweets in the current index (Time aware)
N_s,e: Number of pusblished tweets in the time interval [s,e]
T: Number of terms in the current index (Time aware)
df_t: Number of tweets that contain term t (Time aware)
cf_t: Number of occurrences of term t in the index (Time aware)
sum(len_d): Sum of tweet length in the current index (Time aware)
avg(len_d): Average tweet length in the current index (Time aware)
max(len_d): Maximum tweet length in the current index (Time aware)
max(tf_d,t): Maximum term frequency in the current index (Time aware)

Some of this data is reproducible on client side unless the same tokenizer and stemmer is used.

I defined some Thrift structs for data encoding. Optional fields must be implemented on client side.
(see https://github.com/amjedbj/twitter-tools/blob/prototype-lintool/src/main/thrift/twittertools.thrift)

What do you think?

Latifa-AlMarri · 2013-06-26T23:45:59Z

I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc)

Any Help?

milesefron · 2013-06-27T02:37:52Z

We are integrating these items into the API currently. They should be included soon.
-Miles

Sent from my iPad

On Jun 26, 2013, at 18:46, Latifa [email protected] wrote:

I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc)

Any Help?

—
Reply to this email directly or view it on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement service to return term counts #24

Implement service to return term counts #24

lintool commented Apr 17, 2013

lintool commented Apr 17, 2013

amjedbj commented Apr 17, 2013

stewhdcs commented Apr 17, 2013

amjedbj commented Apr 19, 2013

Latifa-AlMarri commented Jun 26, 2013

milesefron commented Jun 27, 2013

Implement service to return term counts #24

Implement service to return term counts #24

Comments

lintool commented Apr 17, 2013

lintool commented Apr 17, 2013

amjedbj commented Apr 17, 2013

stewhdcs commented Apr 17, 2013

amjedbj commented Apr 19, 2013

Latifa-AlMarri commented Jun 26, 2013

milesefron commented Jun 27, 2013