Skip to content

TREC 2015 Track Guidelines

Jimmy Lin edited this page Jul 18, 2015 · 21 revisions

TREC 2015 Microblog Track: Real-Time Filtering Task Guidelines

The goal of the real-time filtering task in the microblog track is to explore technologies for monitoring a stream of social media posts with respect to a user's interest profile. Note that the conception of an interest profile is different from a typical ad hoc query because there isn't an actual information need. Instead, the goal is for a system to push (i.e., recommend, suggest) interesting content to a user.

The notion of "what's interesting" can be better operationalized by considering two concrete task models:

Scenario A: Push notifications on a mobile phone. Content that is identified as interesting by a system based on the user's interest profile might be shown to the user as a notification on his or her mobile phone. The expectation is that such notifications are triggered a relatively short time after the content is generated. It is assumed that the notifications messages are relatively short.

Scenario B: Periodic email digest. Content that is identified as interesting by a system based on the user's interest profile might be aggregated into an email digest that is periodically sent to a user. It is assumed that each item of content is relatively short; one might think of these as "personalized headlines".

Since in both scenarios it is assumed that the content items delivered to the users are relatively short, "interestingness" can be operationalized as the degree to which the user would want to learn more about the content (e.g., read a longer story).

Evaluation Setup

In the real-time filtering task, the content items are tweets. During the evaluation period, participant's systems will "listen" to Twitter's live tweet sample stream and identify interesting tweets with respect to users' interest profiles (the equivalent of "topics" in other TREC tracks). For convenience, we will refer to this as "tracking" the interest profile. The evaluation time period is as follows:

  • Evaluation start: Monday, July 20, 2015, 00:00:00 UTC
  • Evaluation end: Wednesday, July 29, 2015, 23:59:59 UTC

Note that the evaluation time period is in UTC. Track participants are responsible for translating UTC into their local time to align with the evaluation start and end times.

The final submissions are due (via NIST's upload site) on Thursday, July 30, 2015. This effectively means that you have until when the organizers start work on the morning of July 31 (~7am, US east coast time) to start processing the results. No extensions will be granted!

NIST will develop on the order of 250 interest profiles, which the participants will be responsible for tracking. After the evaluation period, based on post hoc analysis, NIST will select a set of approximately 50 topics that will actually be assessed via the standard pooling methodology. This procedure anticipates challenges in topic development, in that NIST assessors will need to, in some sense, "predict the future" (or at least anticipate future "interesting" events that might occur during the evaluation period). It is expected that many of the interest profiles might prove to be unusable for a variety of reasons (e.g., too few relevant documents). To the extent possible, the assessor who created the topic will assess the tweets.

It is anticipated that NIST will release a couple of sample interest profiles at the end of May.

It is anticipated that the complete interest profiles will be released late June.

IMPORTANT NOTE: This means that during the evaluation period, track participants must maintain a running system that continuously monitors the tweet sample stream. The track organizers will provide boilerplate code and reference baselines, but it is the responsibility of each individual team to run their systems (and coping with crashes, network glitches, power outages, etc.). A starting point for boilerplate code for sampling the public Twitter stream can be found here.

Each system will record interesting tweets that are identified with respect to each interest profile. This information will be stored in a plain text file (format details to be specified later), which will constitute a "run" that is uploaded to NIST servers for assessment after the evaluation period. Note that although systems are expected to conform to the constraints imposed by the evaluation scenarios below, there is no enforcement mechanism because the selected tweets are submitted to NIST as a batch after the evaluation period ends.

Scenario A: Push notifications on a mobile phone. A system for this scenario (a "type A" system) is allowed to return a maximum of 10 tweets per day per interest profile (and may choose to return fewer than ten tweets). If a system returns more than 10 tweets for a given day (per interest profile), NIST will ignore all but the first 10 tweets for that day (for that interest profile). For each tweet, the system shall record the tweet id as well as the timestamp of when the push notification was putatively sent. The evaluation metric (more below) will penalize the gap between the tweet time and the notification time.

Additional commentary: The 10 tweets per day limit is set to realistically model user fatigue in mobile push notifications. However, this has implications for evaluation reusability in terms of coverage of judgments (more later). Note that in this design we are not modeling real-world constraints such as "don't send users notifications in the middle of the night". This simplification is intentional.

Scenario B: Periodic email digest. A system for this scenario (a "type B" system) will identify a batch of up to 100 ranked interesting tweets per day (per interest profile) that are putatively delivered to the user. For simplicity, all tweets from 00:00:00 to 23:59:59 are valid candidates for a particular day. It is expected that systems will compute the results in a relatively short amount of time after the day ends (e.g., at most a few hours), but this constraint will not be enforced.

For both scenarios, systems should only consider tweets in English.

IMPORTANT: Treatment of retweets. All retweets that are explicitly identified as such (in the JSON) will be normalized to the underlying tweet (that was retweeted) and evaluated as such. That is, if the system returns a retweet, it will be treated as if the system had returned the underlying tweet that was retweeted (including how the temporal delay penalty is computed). This means that additional commentary in the tweet (beyond the underlying retweet) will be ignored.

Note that this treatment of retweets is different from previous years, where retweets were essentially treated as non-relevant by fiat. The rationale for this change is that we would like the retweet signal (i.e., number of retweets) to be available to system participants. If we don't treat retweets this way, then if a system observes a highly-retweeted tweet, it would only be a valid result if the original underlying tweet was part of the sample stream, which is unlikely due to the sampling.

This, of course, sweeps under the rug many issues, for example, interesting commentary that might be part of the retweet text, or the interestingness of the act of the retweet itself (e.g., A itself isn't interesting per se, but the fact that B retweeted A is interesting). However, at this stage, this is the best compromise we could come up with.

Evaluation Metrics

All tweets returned by systems in both scenarios will be pooled and each tweet will be judged by NIST assessors independently with respect to the user's interest profile. There will be a single judgment pool for both scenarios. It is expected that scenario A runs will contribute to the pool in their entirety; scenario B runs will contribute to the pool up to a depth determined by NIST after receiving all submissions, based on available resources and other factors. Each tweet will be assessed on a four point scale: spam/junk, not interesting, somewhat interesting, very interesting.

Non-English tweets will be marked as spam/junk by fiat. If a tweet contains a mixture of English and non-English content, discretion will be left to the assessor. As with previous TREC microblog evaluations, assessors will examine links embedded in tweets, but will not explore any additional external content beyond those.

Redundancy. After the standard pooling assessment procedure described above, we will run the clustering protocol from the tweet timeline generation (TTG) task from last year. That is, all tweets will be group into semantic clusters that say the same thing. Systems will only get credit for returning one tweet from each cluster.

For more details, consult:

Evaluation of scenario B (email digest). The score of each run will be computed as follows: for each topic, the list of tweets returned per day will be treated as a ranked list and from this NDCG@k will be computed (where k will be determined after the evaluation is complete, based in part on the pool depth; but it is expected that k will be relatively small). The score of a topic is the average of the NDCK@k scores across all days in the evaluation period. The score of the run is the average over all topics.

Evaluation of scenario A (mobile push notification). The score for a topic for a particular day will be two temporally-discounted gain measures (explained below). The score of the topic will be an average of the daily scores across all days in the evaluation period. The score of a run will be the average of the scores across topics.

The first metric is expected latency-discounted gain (ELG) from the temporal summarization track:

(1 / # tweets returned) x sum { Gain(tweet) }

More details:

  • Not interesting, spam/junk tweets receive a gain of 0.
  • Somewhat interesting tweets receive a gain of 0.5.
  • Very interesting tweets receive a gain of 1.0.
  • Only the first tweet from each cluster receives any credit.
  • A latency penalty is applied to all tweets: the latency penalty is computed as MAX(0, (100 - delay)/100), where the delay is the time elapsed (in minutes, rounded down) between the tweet creation time and the putative time the tweet is delivered. That is, if the system delivers a relevant tweet within a minute of the tweet being created, the system receives full credit. Credit decays linearly such that after 100 minutes, the system receives no credit even if the tweet were relevant.

ELG will be the primary metric. The secondary metric will be normalized cumulative gain (nCG):

(1 / Z) x sum { Gain(tweet) }

where Z is the maximum possible gain (given the 10 tweet per day limit). The gain of each individual tweet will be computed as above.

Additional Details and "Corner Cases": For simplicity, the clustering protocol will be applied to all tweets during the evaluation period, so it is possible that a cluster crosses multiple days. However, if t1 and t2 are in the same cluster, but different days, t2 will still be considered redundant if you've returned t1.

If there are no interesting tweets for that day, the system should be quiet - this accurately models the task. This brings up the related point of scoring when there are no interesting tweets for a day and/or the system returns nothing for a day. We can break it down to two scenarios:

There are interesting tweets for that day.

  • system returns zero tweets - gets score of zero
  • system returns 1+ tweets - score as normal

There are NO interesting tweets for that day.

  • system returns zero tweets - gets score of one (perfect)
  • system returns 1+ tweets - score as normal

This means that an empty run that never returns anything may have a non-zero score, depending on how sparse the profile is.

Specification of Interest Profiles

Interest profiles to be used in the 2015 track will look like traditional TREC topic statements: three fields with the "title" containing a few keywords, the "description" containing a one-sentence statement of the information need , and the "narrative", a paragraph-length description of the information need. Systems can use any/all of the fields in their runs.

Additional commentary: Previously, the interest profiles were going to be as follows:

Interest profiles will be a combination of a narrative and a few (<10) sample Tweets from past collections that would be relevant to that narrative. A given profile will focus on only a single topic in the traditional sense (i.e., a real user profile would probably consist of a set of these profiles). The narratives will reflect a generic interest on the part of the putative user, and relevant tweets might be specific instances thereof. For example, we might have an interest profile about IR, representing an interest in tweets about information retrieval, or a Downton Abbey profile, representing interest in tweets about the TV show.

However, as Ellen summarized on an email to the track mailing list on May 28, 2015:

When I started developing sample profiles, I realized that what we had intended to use as a profile was unlikely to work in that it would be very difficult for systems to be successful. Past history from the filtering track of yore showed that the initial performance of filtering systems is quite poor as the system needs to retrieve at least some irrelevant documents to learn to distinguish between relevant and irrelevant. The Microblog track had proposed using a short topic statement and a few examples of past relevant Tweets, but there is no possibility of active learning in the track. Since the few example Tweets are guaranteed to be an incomplete, highly biased sample of relevant Tweets, I contend systems would just flounder with no real chance of creating a cogent model of the relevant space.

... Microblog topics from 2011-2013 are superficially like title-only topics, and the 2014 topics with the description field are superficially like title-description topics, so could be used as training for some purposes. But note the similarity is only syntactic since the underlying information needs being expressed are different in previous years' topics from this year's. In previous years, the information need was generally quite narrow, usually asking about some specific event that had taken place. In 2015, the needs are to represent long-standing interests, so the statement of information need will be broader, though individual relevant Tweets will still likely be about specific instances. For example, a profile might express an interest in reports of flooding, and > a relevant Tweet would be about the recent Houston storms.

A few other (earlier) proposed ideas that were ultimately rejected include:

  • A set of queries from previous TREC Microblog tracks. One can treat this as a "query log" and take advantage of previous test collections for training.

  • A set of Twitter accounts representing content that the user is interested in. The downside here is a dependence on the Twitter interest graph, which is difficult to crawl due to limitations of the Twitter API. For this evaluation, we would like to place the emphasis on the textual content of the content stream (as opposed to graph features).

Run Submission

Each team will be allowed to submit up to three runs for scenario A and three runs for scenario B.

Systems for either scenario A or scenario B should be categorized into three different types based on the amount of human involvement:

  • Automatic Runs: In this condition, system development must conclude prior to downloading the interest profiles. The system must operate without human input before and during the evaluation period. Note that it is acceptable for a system to perform processing on interest profiles (for example, query expansion) before the evaluation period, but such processing cannot involve human input.

  • Manual Preparation: In this condition, the system must operate without human input during the evaluation period, but human involvement is acceptable before the evaluation period (i.e., after downloading the interest profile). Examples of manual preparation might be the following: after downloading the interest profiles, a human examines them to enrich the original profile with custom keywords, or performs relevance judgments on a related collection to train a classifier. However, once the evaluation period begins, no further human involvement is permissible.

  • Manual Intervention: In this condition, there are no limitations on human involvement before or during the evaluation period. Crowd-sourcing judgments, human-in-the-loop search, etc. are all acceptable.

When submitting a run, you will be asked to designate its type. All types of systems are welcomed; in particular, manual preparation and manual intervention runs will help us understand human performance in this task and enrich the judgment pool.

Submission Format

Scenario A runs should be formatted as a plain text file, where each line has the following fields:

topic_id tweet_id delivery_time runtag

The first two fields are straightforward. The delivery_time is the putative time that the push notification was sent in epoch seconds, i.e., when the system identified the tweet as interesting. From this value the temporal penalty will be computed. The runtag is a unique alphanumeric identifier for your run, usually a combination of your organization name and the name of the algorithm.

You can find a handy human interface to convert epoch seconds to human readable time (and vice versa) here. Note that the Twitter Tools package (i.e., this repo) uses epoch time, so you can consult some of the related code here.

Note that if there are more than ten tweets for a day (given by the delivery_time), all but the first ten on that day will be ignored.

Scenario B runs should be formatted as a plain text file, where each line has the following fields:

YYYYMMDD topic_id Q0 tweet_id rank score runtag

Basically, this is just the standard TREC format prepended with a date in format YYYYMMDD indicating the date the result was generated. "Q0" is a verbatim string that is part of legacy TREC format (i.e., keep it as is). The rank field is the rank of the result, starting with one; score is it's score.

This format allows us to easily manipulate the runs and pass onto existing scoring scripts to compute NDCG, MAP, etc. on a per day basis. Please make sure that rank and score are consistent, i.e., rank 1 has the highest score, rank 2 has the second highest score, etc. Otherwise, scoring ties will be broken arbitrarily.

Unresolved Issues

Repeatability: It is unclear how the evaluation could be made repeatable for participants that did not participate in the track. It would be possible for us (the organizers) can capture the entire content of the stream and develop an API that replays the stream, much in the same way that the search API was used in TREC 2013 and 2014. It is unclear, however, whether such an API would violate Twitter's terms of service, since the client of the API could simply save all tweets.

For TREC participants, they could save the contents of the stream for their own future use.

Another possible option is to publish the tweet ids (since the collection will be relatively small) so that users can download using the crawler from TREC 2011/2012.

Reusability: It is unclear if the nature of the interest profiles will be conducive to creating judgments pools that are sufficiently well-populated to fairly evaluate future systems that did not participate in the original evaluation.