RepLab 2012 -- An evaluation campaign for Online Reputation Management Systems

While traditional reputation analysis was based mostly on manual analysis (clipping from media, surveys, etc.), the key value from online media comes from the ability of processing, understanding and aggregating potentially huge streams of facts and opinions about a company or individual. Information to be mined includes answers to questions such as: What is the general state of opinion about a company/individual in online media? What are its perceived strengths and weaknesses as compared to its peers/competitors? How is the company positioned with respect to its strategic market? Can incoming threats to its reputation be detected early enough to be neutralized before they effectively affect reputation?

In this context, Natural Language Processing plays a key, enabling role and we are already witnessing an unprecedented demand for text mining software in this area. Note that while the area of opinion mining has made significant advances in the last few years, most tangible progress has been focused on products. However, mining and understanding opinions about companies and individuals is, in general, a much harder and less understood problem.

RepLab is a competitive evaluation exercise for Online Reputation Management systems. The aim of RepLab is to bring together the Information Access research community with representatives from the Online Reputation Management industry, with the goals of (i) establishing a five-year roadmap that includes a description of the language technologies required in terms of resources, algorithms, and applications; (ii) specifying suitable evaluation methodologies and metrics; and (iii) developing of test collections that enable systematic comparison of algorithms and reliable benchmarking of commercial systems. The first RepLab campaign is an activity of CLEF 2012, and the results of the exercise will be discussed at the CLEF conference in Rome (17-20 September) -- see http://clef2012.org for details.

RepLab 2012

RepLab 2012 session at CLEF 2012

The 2012 overview paper can be downloaded from here. The RepLab 2012 overview presentation can be found here.

When

  • Tuesday, Sept 18 2012, 10.40-18.10

What

  • Lunch
  • Break
  • 16.40-17.00: Plans for 2013
  • 17.00-18.10: Discussion

Scenarios

RepLab focuses on two scenarios for online reputation management:

  • A "profiling" scenario, which consists of mining the reputation of a company as it distills from online media. Adequate profiling implies harvesting documents from many online sources, and annotating them for reputation-related aspects, e.g.: Is the document related to the company? If so, what are the dimensions of the company affected by the document content (commercial, institutional, social, financial, ...)? Does the content have positive or negative implications for the reputation of the company along those dimensions?
  • A "monitoring" scenario, where the main goal is early alerting on issues that may damage (or, in general, alter) the reputation of a company, institution, person, brand, product, etc. Monitoring implies a frequent inspection of recent online information, and therefore Twitter is a key source for this task.

In its first year, RepLab will address two pilot tasks on companies and Twitter data, each targeting one of the above scenarios. A distinctive feature of both tasks is that manual annotations will be provided by online reputation management experts from a major Public Relations consultancy (Llorente & Cuenca). In other words, data will not only serve to evaluate systems, but also to understand the concept of reputation from the perspective of professional practitioners.

RepLab 2012 Tasks

Profiling task

Systems will be asked to work on Twitter data (tweets containing a company name, for several company names) and annotate two kinds of information on tweets:

  1. Ambiguity: Is the tweet related to the company? (for instance, a tweet containing the word "subway" may refer to the fast food company or to the underground city transport). Manual assessments will be provided by reputation management experts, with three possible values: relevant/irrelevant/undecidable. Tweets annotated as relevant/irrelevant will be used to evaluate systems.
  2. Polarity for Reputation: Does the tweet content have positive or negative implications for the company's reputation? Manual assessments will be: positive/negative/neutral/undecidable. Tweets in the first three categories will be used to assess systems' performance.

Note that polarity for reputation is substantially different from standard sentiment analysis:

  • First, both facts and opinions may have polarity for reputation. For instance, "Lehmann Brothers goes bankrupt" is a fact with negative implications for reputation. Therefore, systems will not be explicitly asked to classify tweets as factual vs. opinionated: the goal is finding polarity for reputation, regardless of whether the content is opinionated or not.
  • Second, negative sentiments do not always imply negative polarity for reputation. For instance, "R.I.P. Michael Jackson. We'll miss you" has a negative associated sentiment (sadness), but a positive implication for the reputation of Michael Jackson.

Monitoring task

Systems will receive a stream of tweets containing the name of an entity, and their goal is to (i) cluster the most recent tweets thematically, and (ii) assign relative priorities to the clusters. A cluster with high priority represents a topic which may affect the reputation of the entity and deserves immediate attention.

Manual assessments will consist of:

  1. Suitable topical clusters for the tweets.
  2. Our reputation experts are using four explicit priority levels: alert > average priority > low priority > irrelevant (not about the company).
  3. In addition, there is one more implicit priority level which comes from the "other" cluster. This cluster is used for tweets that are about the company, but do not qualify as topics and are negligible for the sake of monitoring purposes.

    Therefore, in the gold standard there may be up to five priority levels:

    alert > average priority > low priority > tweets in the "other" cluster > irrelevant

  4. A four-level graded assessment of the priority of each cluster: irrelevant (the topic does not refer to the company), low priority, average (the topic does refer to the company but does not demand urgent attention) or alert (the topic deserves immediate attention).
    2) In addition, there is one more implicit priority level which comes from the "other" cluster. This cluster is used for tweets that are about the company, but do not qualify as topics and are negligible for the sake of monitoring purposes.

    Therefore, in the gold standard there may be up to five priority levels:

    alert > average priority > low priority > tweets in the "other" cluster > irrelevant

These annotations will be used to evaluate the output of the systems, which is expected to be a rank of clusters containing topically similar tweets. Some of the factors that may play a role in the priority assessments are:

  • Novelty. Monitoring is focused on early discovery of issues that might affect the reputation of the client (the company in RepLab data); in general, already known issues are less likely to fire an alert.
  • Polarity. Topics with polarity (and, in particular, with negative polarity, where action is needed) usually have more priority.
  • Centrality. A high priority topic is very likely to have the company as the main focus of the content ("centrality" corresponds to the classical notion of relevance in Document Retrieval).
  • Trendiness (actual and potential). Topics with a lot of twitter activity are more likely to have high priority. Note that experts also try to estimate how a topic will evolve in the near future (for instance, it may involve a modest amount of tweets, but from people which are experts in the topic and have a large number of followers). A topic likely to become a trend is particularly suitable to become an alert (and therefore to receive a high priority).

Note, however, that the priority of a topic is determined by online reputation experts according to their expertise and intuitions; therefore, priority assessments will not always necessarily have a direct, predictable relationship with the factors above.

Data

Both tasks will use Twitter data in English and Spanish. The balance between both languages will depend on the availability of data for each of the companies included in the dataset.

Trial data

Trial data is already available upon registration (see section 6). It consists of:

  • 30,000 tweets crawled per company name, for six companies (Apple, Lufthansa, Alcatel, Armani, Marriott, Barclays) using the company name as query, in English and Spanish.
  • For each company's timeline, 300 tweets (approximately in the middle of the timeline) have been manually annotated by reputation management experts. This is the "labeled" dataset. The rest (around 15,000 unannotated tweets before and after the annotated set, for each company), is the "background" dataset and has not been annotated.
  • Manual annotations consists of the following:

    For the profiling task, each tweet is annotated with two fields: related (is the tweet about the company?) and polarity for reputation (does the tweet content have negative/neutral/positive implications for the company's reputation?).

    For the monitoring task, tweets are clustered topically (using topic labels), and clusters are annotated for priority (does the cluster topic demands urgent attention from the point of view of reputation management?)

Test data

Test data (released by the end of May 2012) will be identical to trial data, for a different set of 25 companies.

Evaluation Measures

Evaluation will be performed on tweets which are still accessible on the last day of the submission deadline (July 15).

Measures for the Monitoring Task

The monitoring task is a problem that combines clustering (detecting sets of tweets that refer to the same topic) with ranking: topics/clusters have to be ranked by priority, so that topics that require immediate attention appear at the top of the ranking.

There is no standard measure that combines clustering with ranking. We will use a novel pair of measures, Reliability and Sensitivity, described in a technical report available here.

In essence, these measures consider two types of binary relationships between pairs of items: relatedness -- two items belong to the same cluster -- and priority -- one item has more priority than the other. Reliability is defined as precision of binary relationships predicted by the system with respect to those that derive from the gold standard; and Sensitivity is similarly defined as recall of relationships.

When only clustering relationships are considered, Reliability and Sensitivity are equivalent to BCubed Precision and Recall, which have been shown to be preferable to other Clustering evaluation measures (see Amigó et al 2009 for details).

Measures for the Profiling Task

The profiling task requires solving name ambiguity and annotating polarity for reputation.

As for Polarity for Reputation, it consists of predicting the outcome of a variable (polarity). We can evaluate systems by measuring the correlation between two variables: the output of the system and the ground truth. If we stick to three possible values only (positive/neutral/negative), we can also see the task as a ternary classification problem, and report precision/recall on each of the classes. We can also report accuracy (proportion of cases where the system guesses the right class), but this is of little help in our case because the neutral class dominates. We will report Pearson correlation, precision/recall on positive and negative classes, and accuracy.

As for name ambiguity resolution, it is a filtering problem: we want to discard noise in the input stream and retain the information about the company. Typical evaluation measures for filtering are utility, Lam (logistic average misclassification), Precision and Recall on each class, etc. These metrics have different properties and give different system rankings (in fact there is little correlation between them); and there are no compelling reasons to prefer one of them for our problem. We will use the same Reliability and Sensitivity measures described for the monitoring problem above; in the case of filtering, they are equivalent to the product of precisions over positive and negative classes (reliability) and the product of recalls (sensitivity). The property that makes them particularly suitable for the filtering problem is that they are strict with respect to standard measures, i.e., a high value according to Reliability and Sensitivity implies a high value in all standard measures (which will also be reported for reference).

A combined score will also be reported; for the combined score, a tweet will be considered correct if both classifications (ambiguity and polarity) are correctly inferred.

Submission Instructions

Monitoring Task

Each group is allowed to submit up to five runs.

Each run consists of a directory named replab2012_monitoring_[group id]_[run] with one file per entity that conforms to the format specified in ./trial/RepLab-Trial-Corpus-README.txt and where

- "group id" is the identifier for the research group, and must be an alphanumeric string ([A-Z|0-9]+)

- "run" is a number between 1 and 5

The run directory must be zipped and e-mailed to julio@lsi.uned.es before July 15, 23:00 GMT. Preferably, all runs should be attached to a single email with subject "RepLab 2012 monitoring task submission, [group id]".

The format for each file (corresponding to an entity and a run) is:

- The name of the file that corresponds to each entity is the entity id (e.g. "RL2012E06" corresponds to Gas Natural)

- Each file contains one line per level of priority (ranked in descending order of priority). Each line contains a list of clusters separated with " | ". Each cluster is a list of tweet identifiers separated by whitespaces. The relative ordering between tweets in the same cluster is irrelevant.

One example of directory name could be "replab2012_monitoring_USpringfield_1". Files under this directory would be named "RL2012E06", "RL2012E07", etc., and the first two lines of a file could look like this:

175706618909032449 175705440368332802 | 175707077644263424 175694761225748480 175698423759118337 175706429808844801 175701506396401664 175701229207425024 175695382012104705 | 175639547135270913 | 175703142225281024 175639626629918721 175709825236348928 175701997117390849 175709322582568960 175697136808239104 175697148581650433 175708783224438784 175702308909363201 175639589686480896 175639618484580353 175701872026456064 175709649864097794 175702382125129728 175701202812682241 175639632468377600 175705278665326593 175708743827329025 175709842563006466 175639588935704576 175709823869001728 175708742199934977 175700039312752640 175703015746048001 175709530523582464 175707932539879424 175708462490206209 175707260281028608 175639590848311296 175701305988362240 175703023597780994 175706928708722688 175705923644424192 175706396128575488 175703211301281792 175709257856073729 175709841552179202 175702850171703297 175709824435224576 175706916092264448 175709093443551232 175639629989552128 175709828981854210
175703350636068864 175706888166576129 175639606228811777 | 175704319390265345 175639565934145536 175701989542473728 175696878099369984 175639576407314432 175701313772986368 175639563782459392 175639579167170560 175698683617230848 175701598817878019 175639588130406400 175699490311909376 175704881942892545 175639611446525953 175696611685572608 175705282477965314 175639618950139904 175705172775944192 175705615933521920 175702946447757312 175639562561929216 175704532544794625 175705745617195010 175639564898152449 175707071814184961 175704931691544576 175704702879666176 175706554606170113 175708126660661248 175702769997586435 | 175705839137587201 175706819308691456 175701570359537664 175701880813535232 175703820234539009 175639534959210496 175639617448574976 175699887155978240 175639575274856448 175702309706276864 175709403536834560 175697416069201920 175709327284371457 175708185699684352 175708937843253248 175706527674548224 175639539669405696 175639621928095745 175709826419138561 175705116756803588

Profiling Task

[NOTE: we have changed the output format with respect to trial data in order to simplify the evaluation process. Evaluation scripts distributed with the trial data will not work on this output format. Apologies for the inconvenience!]

Each group is allowed to submit up to five runs. Runs must be e-mailed to julio@lsi.uned.es before July 15, 23:00 GMT. Preferably, all runs should be attached to a single email with subject "RepLab 2012 profiling task submission, [group id]".

Each run consists of a single file named replab2012_profiling_[group id]_[run] containing one line per tweet, with the following format:

entity_id tweet_id related polarity

where:

- "group id" is the identifier for the research group, and must be an alphanumeric string ([A-Z|0-9]+). If you are submitting results for both the monitoring and profiling tasks, please use the same group for both.

- "run" is a number between 1 and 5

- "entity_id" is the identifier of the company as used in ./test/replabl2012_test_entities_tsv (e.g. "RL2012E06" corresponds to Gas Natural)

- "tweet_id" is the id of the tweet as listed in the entity files under ./test/unlabeled/tweets_info (e.g. "175706618909032449")

- "related" is either "yes" (the tweet mentions the company) or "no" (the tweet does not mention the company)

- "polarity" is either "negative", "neutral" or "positive"

- columns are separated by a single whitespace

Note that we will provide evaluation measures for three tasks with this file: filtering (using the "related" field), polarity (using the "polarity" field) and the full task (using both fields).

IMPORTANT: In the evaluation of polarity, we will only consider the output of the system for the true relevant tweets. That means that systems will not be penalized for assigning polarity to irrelevant tweets.

You can choose to submit results only for one of the subtasks (filtering only or polarity only):

- If you want to give results for the filtering task, the file must be named replab2012_related_[group id]_[run] and the format of each line will be:

entity_id tweet_id related

- If you want to give results for the polarity field only, the file must be named replab2012_polarity_[group id]_[run] and the format of each line will be:

entity_id tweet_id polarity

Note that the limit of 5 runs apply to the sum of full profiling runs, filtering only runs, and polarity only runs.

Important dates

April 12 Release of trial data & complete guidelines
June 8 Release of test data
July 12 23:00 GMT, System results due
July 16 Official results released
August 17 Deadline for paper submission
September 17-20 CLEF 2012 Conference in Rome

Organizers

RepLab is an activity sponsored by EU project Limosine (http://limosine-project.eu). Lab organizers are:

Adolfo Corujo, Llorente & Cuenca (acorujo@llorenteycuenca.com)
Julio Gonzalo, UNED (julio@lsi.uned.es)
Edgar Meij, University of Amsterdam (edgar.meij@uva.nl)
Maarten de Rijke, University of Amsterdam (derijke@uva.nl)

Steering Committee

Eugene Agichtein, Emory University, USA
Alexandra Balahur, JRC, Italy
Krisztian Balog, NTNU, Norway
Raymond Franz, Trendlight, The Netherlands
Donna Harman, NIST, USA
Eduard Hovy, ISI/USC, USA
Radu Jurca, Google, Switzerland
Jussi Karlgren, Gavagai/SICS, Sweden
Mounia Lalmas, Yahoo! Research, Spain
Jochen Leidner, Thomson Reuters, Switzerland
Bing Liu, U. Illinois at Chicago, USA
Alessandro Moschitti, U. Trento, Italy