RepLab 2013 – An evaluation campaign for Online Reputation Management Systems

1. About RepLab

RepLab is a competitive evaluation exercise for Online Reputation Management systems. Like first RepLab held in 2012, this second campaign will be organized as an activity of CLEF, and the results of the exercise will be discussed at the CLEF 2013 conference in Valencia, Spain, on the 23-26th September (see for details).

RepLab 2013 will focus on the task of monitoring the reputation of entities (companies, organizations, celebrities, ...) on Twitter. The monitoring task for analysts consists of searching the stream of tweets for potential mentions to the entity, filtering those that do refer to the entity, detecting topics (i.e., clustering tweets by subject) and ranking them based on the degree to which they signal reputation alerts (i.e., issues that may have a substantial impact on the reputation of the entity).

2. Tasks

The RepLab 2013 task is defined, accordingly, as (multilingual) topic detection combined with priority ranking of the topics, as input for reputation monitoring experts. The detection of polarity for reputation (does the tweet have negative/positive implications for the reputation of the entity?) is an essential step to assign priority, and will be evaluated as a standalone subtask.
Participants are welcome to present systems that attempt the full monitoring task (filtering + topic detection + topic ranking) or modules that contribute only partially to solve the problem. Possible modules are related to the following components of the whole reputation management task:
  1. Filtering. Systems will be asked to determine which tweets are related to the entity and which are not, for instance, distinguishing between tweets that contain the word "Stanford" referring to the University of Stanford and filtering out tweets about Stanford as a place. Manual annotations will be provided with two possible values: related/unrelated.
  2. Polarity for Reputation classification. The goal will be to decide if the tweet content has positive or negative implications for the company's reputation. Manual annotations are: positive/negative/neutral.
  3. Topic Detection: Systems will be asked to cluster related tweets about the entity by topic with the objective of grouping together tweets referring to the same subject. 
  4. Assigning priority. The full task involves detecting the relative priority of topics. So as to be able to evaluate priority independently from the clustering task, we will evaluate the subtask of predicting the priority of the cluster a tweet belongs to.  
It will be possible to present systems that address only filtering, only polarity identification, only topic detection or only priority assignment. The organization will provide baseline components for all of the four subtasks. This way any participant will be able to participate in the full task regardless of where his particular contribution lies. Evaluation results will be provided for the full task and for each of the four subtasks listed above. 
Details on the polarity for reputation and topic detection tasks follow.
Polarity for reputation is substantially different from standard sentiment analysis:
  • First, when analyzing polarity for reputation, both facts and opinions have to be considered. For instance, "Barclays plans additional job cuts in the next two years" is a fact with negative implications for reputation. Therefore, systems will not be explicitly asked to classify tweets as factual vs. opinionated: the goal is to find polarity for reputation, that is what implications a piece of information might have on the reputation of a given entity, regardless of whether the content is opinionated or not.
  • Second, negative sentiments do not always imply negative polarity for reputation and vice versa. For instance, "R.I.P. Michael Jackson. We'll miss you" has a negative associated sentiment (sadness, deep sorrow), but a positive implication for the reputation of Michael Jackson.
And the other way around, a tweet such as "I LIKE IT..... NEXT...MITT ROMNEY...Man sentenced for hiding millions in Swiss bank account", has a positive sentiment (joy about a sentence) but has a negative implication for the reputation of Mitt Romney.
As for the topic detection + topic ranking process, a three-valued classification will be applied to assess the priority of each entity-related topic: alert (the topic deserves immediate attention of reputation managers), mildly relevant (the topic contributes to the reputation of the entity but does not require immediate attention) and unimportant (the topic can be neglected from a reputation management perspective). Some of the factors that play a role in the priority assessments are:
  • Polarity. Topics with polarity (and, in particular, with negative polarity, where action is needed) usually have more priority.
  • Centrality. A high priority topic is very likely to have the company as the main focus of the content.
  • User's authority. A topic promoted by an influential (for example, in terms of the number of followers or the expertise) user has better chances of receiving high priority.

3. Data

RepLab 2013 uses Twitter data in English and Spanish. The balance between both languages depends on the availability of data for each of the entities included in the dataset.
The corpus consists of a collection of tweets referring to a selected set of 61 entities from four domains: automotive, banking, universities and music/artists. Crawling was performed during the period from the 1st June 2012 till the 31st Dec 2012 using the entity’s canonical name as query. For each entity, at least 2,200 tweets are collected: the first 700 are used as training set, and the rest as test set. The corpus also comprises additional background tweets for each entity (up to 50,000, with a large variability across entities). Note that the final amount of available tweets in these sets may be lower, since some posts may have been deleted by the users: in order to respect Twitter’s terms of service, the organizers do not provide the contents of the tweets. The tweet identifiers can be used to retrieve the texts of the posts similarly to the mechanism used in the TREC Microblog Track in 2011 and 2012.
The training and test data sets are manually labelled by annotators who are trained and guided by experts in online reputation management. Each tweet in the training and test sets is annotated as follows:
  • RELATED/UNRELATED: the tweet is/is not about the entity
  • POSITIVE/NEUTRAL/NEGATIVE: the information contained in the tweet has positive/neutral/negative implications for the entity's reputation.
  • Identifier of the topic (cluster) the tweet belongs to. 
  • ALERT/MIDLY_IMPORTANT/UNIMPORTANT: the priority of the topic (cluster) the tweet belongs to.

4. Evaluation Measures

Evaluation measures will be provided for the full task and for two salient subtasks: filtering and polarity for reputation.

4.1. Full monitoring task: filtering + topic detection + topic priority.

For this combined task we will use the same measures as in RepLab 2012, Reliability and Sensitivity, which are described in Amigó, E., Gonzalo, J., Verdejo, F.: Reliability and Sensitivity: Generic Evaluation Measures for Document Organization Tasks. Technical Report, UNED (2012).
In essence, these measures consider two types of binary relationships between pairs of items: relatedness - two items belong to the same cluster - and priority - one item has more priority than the other. Reliability is defined as precision of binary relationships predicted by the system with respect to those that derive from the gold standard; and Sensitivity is similarly defined as recall of relationships.
For the evaluation of the full task, we consider clustering relationships (a pair of tweets belong to the same topic) and four levels of priority relationships: the first three levels are given by the alert/mildly_important/unimportant annotation, and the set of tweets discarded in the filtering step are considered as a fourth level with the lowest priority. 
Note that Polarity for reputation is not considered explicitly, but implicitly, as it is one of the factors in the priority ranking.

4.2. Filtering subtask

This step is meant to discard noise in the input stream and retain the information about the analyzed entity. We will use the same Reliability and Sensitivity measures described for the monitoring problem above; in the case of filtering, they are equivalent to the product of precision scores over positive and negative classes (reliability) and the product of recall scores (sensitivity). The property that makes them particularly suitable for the filtering problem is that they are strict with respect to standard measures, i.e., a high value according to Reliability and Sensitivity implies a high value in all standard measures (which will also be reported for reference).

4.3. Polarity for reputation subtask

Identifying polarity for reputation implies predicting the outcome of a variable (polarity). We can evaluate systems by measuring the correlation between two variables: the output of the system and the ground truth. If we stick to three possible values only (positive/neutral/negative), we can also see the task as a ternary classification problem, and report precision/recall on each of the classes. We can also report accuracy (proportion of cases where the system guesses the right class), although this is of little help if the neutral class dominates. We will report Pearson correlation, precision/recall on positive and negative classes, and accuracy.

4.4 Topic detection subtask

We use Reliability and Sensitivity, which for this clustering problem are equivalent to BCubed Precision and Recall. 

4.5 Priority assignment subtask

In order to evaluate priority assignment independently from the clustering step, we consider individual tweets and assign them the priority of the topic they belong to. Reliability and Sensitivity are then applied with three levels of priority (alert, mildly_important, unimportant). 

5. Baselines

Here ( you will find baseline outputs for all RepLab subtasks: filtering, polarity annotation, topic detection (clustering) and priority assignment.

The baseline is a simple (memory-based) supervised system that matches each tweet in the test set with the most similar tweet in the training set, and assumes that the annotations - for all subtasks - in the tweet from the training set are also valid for the tweet in the test set. Tweet similarity is computed using Jaccard distance and a straightforward bag-of-words representation of the tweets.

You can use these baseline outputs to ensemble a full RepLab 2013 system in combination with your subtask systems.

Note that:

  1. The filtering output contains annotations for all tweets. 
  2. The polarity output also contains annotations for all tweets, given that annotating unrelated tweets is not penalized (they are simply ignored in the evaluation).
  3. For the rest of subtasks (topic detection and priority), only those tweets judged relevant by the baseline filtering are included in the baseline output.

6. Important dates

  • April 15: Release of training and test data
  • May 27 June 3: System results due
  • June 5 June 12: Official results released
  • June 15 June 22: Deadline for paper submission
  • September 23-26: CLEF 2013 Conference in Valencia, Spain

7. How to submit runs?

Here are the instructions to submit your runs (please note that the deadline is June 3 and cannot be further postponed).  

  1. What is the number of runs allowed?

    Each group is allowed to send up to 10 runs per subtask: filtering, polarity, topic_detection and priority_detection, plus 10 runs for the full_task.

  2. How to format your submission?
    • Each group must pick up a group id (alphanumeric string, preferably short).
    • All runs must be packed in a directory named replab2013-<group-id>
    • Inside this directory, each run should be in a separate directory named <group_id>_<subtask>_<run_id> where run_id is a number between 1 and 10. For instance: replab2013-UNED/UNED_full_task_2. The directory will contain one file per subtask included in the run , with up to four files ("filtering","polarity","topic_detection","priority_detection").
    • Files must follow the specifications of the evaluation package distributed with the data.

  3. How to submit?

    The compressed directory with your runs must be sent as a single file to and (preferably as a download URL), together with a separate file (see this MS Excel template) containing metadata about your runs.

  4. How to prepare your paper for the workshop notes

    Each group must prepare one paper describing all experiments in all subtasks, following the formatting guidelines in the CLEF 2013 website. If you feel that your work should be split in more than one report (in cases of disjoint experiments with disjoint authors, for instance), please ask the lab organizers (email to

8. Organizers

RepLab is an activity sponsored by the EU project LiMoSINe (
Lab organizers are:
  1. Adolfo Corujo (Llorente & Cuenca, Madrid),,
  2. Julio Gonzalo (UNED, Madrid),,
  3. Edgar Meij (Yahoo! Research),,
  4. Maarten de Rijke (U. of Amsterdam),,
Steering Committee:
  • Eugene Agichtein, Emory University, USA
  • Alexandra Balahur, JRC, Italy
  • Krisztian Balog, U. Stavanger, Norway
  • Donna Harman, NIST, USA
  • Eduard Hovy, ISI/USC, USA
  • Radu Jurca, Google, Switzerland
  • Jussi Karlgren, Gavagai/SICS, Sweden
  • Mounia Lalmas, Yahoo! Research, Spain
  • Jochen Leidner, Thomson Reuters, Switzerland
  • Bing Liu, U. Illinois at Chicago, USA
  • Alessandro Moschitti, U. Trento, Italy
  • Miles Osborne, U. Edinburgh, UK
  • Hans Uszkoreit, U. Saarbrücken, Germany
  • James Shanahan, Boston U., USA
  • Belle Tseng, Yahoo!, USA
  • Julio Villena, Daedalus/U. Carlos III, Spain
For more information about RepLab, visit: