*SEM 2013 SHARED TASK STS Semantic Textual Similarity: A Unified Framework for the Evaluation of Modular Semantic Components CORE DATASET This set of files describes the CORE DATASET for the main track of the *SEM 2013 SHARED TASK on Semantic Textual Similarity. The trial dataset contains the following: 00-README.txt this file correlation.pl evaluation script for a single dataset correlation-all.pl evaluation script for all datasets correct-output.pl data integrity script sts2012-paper.pdf paper describing the STS 2012 task sts2012-train.tgz train data from STS 2012 sts2012-test.tgz test data from STS 2012 STS.input.headlines.txt tab separated sample input file with sentence pairs STS.input.OnWN.txt " STS.input.FNWN.txt " STS.input.SMT.txt " STS.gs.headlines.txt tab separated sample gold standard STS.gs.OnWN.txt " STS.gs.FNWN.txt " STS.gs.SMT.txt " STS.output.headlines.txt tab separated sample output STS.output.OnWN.txt " STS.output.FNWN.txt " STS.output.SMT.txt " Introduction ------------ Given two sentences of text, s1 and s2, the systems participating in this task should compute how similar s1 and s2 are, returning a similarity score, and an optional confidence score. The trial dataset comprises the 2012 dataset, which can be used to develop and train systems. In addition, we include sample data for the core test datasets, coming from the following: 1) news headlines (headlines) 2) mapping of lexical resources (OnWN and FNWN) 3) evaluation of machine translation (SMT) The trial data is a small subset of the sentence pairs that will be used as test data, with dummy gold standard scores. The goal of these samples is to allow participants to have an idea of which kind of sentences will occur in each of the test datasets. Note that the OnWN and FNWN test sets will be smaller, around 50% of the other two datasets (headlines and SMT), so we will have approximately the same number of pairs on each of the three kinds of pairs. The datasets has been derived as follows: - STS.input.headlines.txt: we used headlines mined from several news sources by European Media Monitor using the RSS feed. http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html 5 pairs of sentences. - STS.input.OnWN.txt: The sentences are sense definitions from WordNet and OntoNotes. 5 pairs of sentences. - STS.input.FNWN.txt: The sentences are sense definitions from WordNet and FrameNet. 5 pairs of definitions. Note that some FrameNet definitions involve more than one sentence. - STS.input.SMT.txt: This SMT dataset comes from DARPA GALE HTER and HyTER, where one sentence is a MT output and the other is a reference translation where a reference is generated based on human post editing (provided by LDC) or an original human reference (provided by LDC) or a human generated reference based on FSM as described in (Dreyer and Marcu, NAACL 2012). 5 pairs of sentences where the reference comes from post edited translations. The sentence pairs have been manually tagged with a number from 0 to 5, as defined below (cf. Gold Standard section). In this trial dataset, this is just a dummy number which you can ignore. NOTE: Participant systems should NOT use the following datasets to develop or train their systems: - Ontonotes - Wordnet sense aligned definitions. - DARPA GALE HTER and HyTER datasets. Input format ------------ The input file consist of two fields separated by tabs: - first sentence (does not contain tabs) - second sentence (does not contain tabs) Please check any of STS.input.*.txt Gold Standard ------------- The gold standard contains a score between 0 and 5 for each pair of sentences, with the following interpretation: (5) The two sentences are completely equivalent, as they mean the same thing. The bird is bathing in the sink. Birdie is washing itself in the water basin. (4) The two sentences are mostly equivalent, but some unimportant details differ. In May 2010, the troops attempted to invade Kabul. The US army invaded Kabul on May 7th last year, 2010. (3) The two sentences are roughly equivalent, but some important information differs/missing. John said he is considered a witness but not a suspect. "He is not a suspect anymore." John said. (2) The two sentences are not equivalent, but share some details. They flew out of the nest in groups. They flew into the nest together. (1) The two sentences are not equivalent, but are on the same topic. The woman is playing the violin. The young lady enjoys listening to the guitar. (0) The two sentences are on different topics. John went horse back riding at dawn with a whole group of friends. Sunrise at dawn is a magnificent view to take in if you wake up early enough for it. Format: the gold standard file consist of one single field per line: - a number between 0 and 5 The gold standard in the test data will be assembled using mechanical turk, gathering 5 scores per sentence pair. The gold standard score will the average of those 5 scores. In this trial dataset, this is just a dummy number which you can ignore. Please check any of STS.*.gs.txt Answer format -------------- The answer format is similar to the gold standard format, but includes an optional confidence score. Each line has two fields separated by a tab: - a number between 0 and 5 (the similarity score) - a number between 0 and 100 (the confidence score) The use of confidence scores is experimental, and it is not required for the official score. Please check STS.MSRpar.output.txt which always returns 2.5 with confidence 100. The output file needs to conform to the above specifications. Files which do not follow those will be automatically removed from evaluation. Please check that your answer files are in the correct format using the following script: $ ./correct-output.pl STS.output.SMT.txt Output file is OK! In addition to printing errors and a final message on standard error, the script returns 0 if correct, and another value if incorrect. Scoring ------- The oficial score is based on the average of Pearson correlation. The use of confidence scores will be experimental, and it is not required for the official scores. For instance the following script returns the correlation for individual pairs: $ ./correlation.pl STS.gs.SMT.txt STS.output.SMT.txt Pearson: -0.30979 The following script returns the weighted average for all datasets in the current directory, where the weights depend on the number of pairs of each dataset: $ ./correlation-all.pl . STS.output.headlines.txt Pearson: -0.30979 STS.output.OnWN.txt Pearson: -0.30979 STS.output.FNWN.txt Pearson: -0.30979 STS.output.SMT.txt Pearson: -0.30979 Mean: -0.30979 Participation in the task ------------------------- Participant teams will be allowed to submit three runs at most. NOTE: Participant systems should NOT use the following datasets to develop or train their systems: - the data from Ontonotes / Wordnet mapping (or framenet / wordnet) Other ----- Please check http://ixa2.si.ehu.es/sts for more details. We recommend that potential participants join the task mailing list: http://groups.google.com/group/STS-semeval Authors ------- Eneko Agirre Daniel Cer Mona Diab Aitor Gonzalez-Agirre Weiwei Guo German Rigau Acknowledgements ---------------- The WordNet-FrameNet mappings are funded by NSF 11-536 CRI planning award for LexLink, Christiane Fellbaum, Collin Baker, Martha Palmer and Orin Hargraves, and by NSF award CRI:CI-ADDO-EN 0855271 for Christiane Fellbaum and Collin Baker. We are grateful for their sharing of the data. We are grateful to the OntoNotes team for sharing OntoNotes to WordNet mappings (Hovy et al. 2006). We thank DARPA and LDC for providing the SMT data. References ---------- Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL. Markus Dreyer and Daniel Marcu. 2012. HyTER: Meaning-Equivalent Semantics for Translation Evaluation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.