In this page you can find the necessary information to recreate the Robust-WSD CLEF 2008 and 2009 results:
- the topics
- the relevance judgements
- instructions to obtain the documents and/or ready-made indexes
Note that, unfortunately, we cannot release the target documents (GH95, LA94) due to license issues. If you have participated in previous ad-hoc tasks at CLEF, you probably have those documents. Otherwise, there are the following posibilites:
- Obtain the documents through ELDA, purchasing the CLEF Test Suite for the CLEF 2000-2003 Campaigns – Evaluation Package (at a very low price for research institutions)
- Use the unordered set of words in each document, where we have eliminated positional information, in order to avoid the replication of the originals. Both a text version and indexes for Lucene are available. See below to download.
Each compressed file contains the following topic subsets:
- topics.es.1ST: Spanish disambiguated topics
- topics.en.NUS: English disambiguated topics (NUS team)
- topics.en.UBC: English disambiguated topics (UBC team)
Topics are annotated by systems for word sense disambiguation (WSD). Spanish topics have been annotated with a first sense heuristic. English topics have been annotated with the following word sense isambiguation systems:
- Agirre, Eneko & Lopez de Lacalle, Oier (2007). UBC-ALM: Combining k-NN with SVD for WSD. Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval 2007). pp. 341-345. Prague, Czech Republic.
- Chan, Yee Seng, & Ng, Hwee Tou, & Zhong, Zhi (2007). NUS-PT: Exploiting Parallel Texts for Word Sense Disambiguation in the English All-Words Tasks. Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval 2007). pp. 253-256. Prague, Czech Republic.
Check the DTD for XML layout of the topics.
The WSD data is based on WordNet version 1.6. In order to expand from WordNet synset numbers to words in English and Spanish, you will need the following:
- The English WordNet version 1.6 available from here
- The Spanish WordNet, available free for research from here
Relevance judgements: train / test
Documents as unordered bag of words: readme / file
In order to replicate the experiments, we provide the words, lemmas and WSD information for each document. The word occurrences are presented without positional information. The lucene indexes are also available below.
Indexes in lucene: readme / file
These Lucene indexes have been generously provided by José Ramón Pérez and Hugo Zaragoza. Note that we have eliminated positional information, in order to avoid the replication of the originals. Still, these indexes provide the same results as those reported in their paper .
You can use treceval for scoring.