-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement service to return term counts #24
Comments
We might not even need a service: if we stored term counts in this way: termid -> [ vector of counts... ] the format would be pretty much identical to the google books datasets |
We need to retrun for each term:
|
It would probably be useful to do this unigrams and bigrams. The size of the file could be reduced by filtering out low frequencies overall the collection, or per 'bucket' period. We can specify buckets every N hours from the start of the corpus. N = 4/6/12 hours would probably be more than enough. At least with a smaller than necessary interval, people can easily aggregate intervals together as necessary using integer division on the bucket offset. We would also need the background model of document frequencies in each bucket so we can compute term probabilities as well. |
What about tweet and term statistics of the current index. Some IR baslines requires collection statistics such as average tweet length (i.e. Okapi BM25). This is a non-exhaustive list of index stats:
Some of this data is reproducible on client side unless the same tokenizer and stemmer is used. I defined some Thrift structs for data encoding. Optional fields must be implemented on client side. What do you think? |
I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc) Any Help? |
We are integrating these items into the API currently. They should be included soon. Sent from my iPad On Jun 26, 2013, at 18:46, Latifa [email protected] wrote:
|
We need a service to return time counts within a certain interval. Need to decide:
The text was updated successfully, but these errors were encountered: