STSbenchmark
STS benchmark dataset and companion dataset
STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.
In order to provide a standard benchmark to compare among meaning representation systems in future years, we organized it into train, development and test. The development part can be used to develop and tune hyperparameters of the systems, and the test part should be only used once for the final system. Some authors (e.g. Arora et al. 2017; Mu et al. 2017; Wieting et al. 2016) report the results across different years, with a mixture of many genres and training conditions. STS benchmark provides a standard setup for training, development and test on three selected genres (news, captions, forums).
See SemEval 2017 task paper (Section 8 here).
Download here.
Also included in SentEval
The benchmark comprises 8628 sentence pairs. This is the breakdown according to genres and train-dev-test splits:
train dev test total ----------------------------- news 3299 500 500 4299 caption 2000 625 625 3250 forum 450 375 254 1079 ----------------------------- total 5749 1500 1379 8628
For reference, this is the breakdown according to the original names and task years of the datasets:
genre file years train dev test ------------------------------------------------ news MSRpar 2012 1000 250 250 news headlines 2013-16 1999 250 250 news deft-news 2014 300 0 0 captions MSRvid 2012 1000 250 250 captions images 2014-15 1000 250 250 captions track5.en-en 2017 0 125 125 forum deft-forum 2014 450 0 0 forum answers-forums 2015 0 375 0 forum answer-answer 2016 0 0 254
Results
We report systems in two tables: In the top table we report systems that use neural representation models alone, and in the other table we report feature engineered and mixed systems.
For each system we further detail two traits:
- Sentence representation model used in the system:
- Independent: systems that are solely based on a pair of sentence representations that are computed independently of one another
- Other: systems that also use interactions between sentences (e.g. alignments, attention or other features like word overlap)
- Amount of supervision used by the system:
- Unsupervised: systems that do not use any STS train or development data (can include transfer learning, or resources like WordNet or PPDB)
- Dev: systems that only use the STS benchmark development data (weakly supervised)
- Train: systems that only use the STS benchmark training and development data (fully supervised)
- Unconstrained: systems that in addition to STS benchmark use other STS training data like SICK or SemEval 2015 task 1
Sentence Representation | Supervision | Paper | Comments | Dev | Test |
---|---|---|---|---|---|
Independent | Unsupervised | Pennington et al. 14 | Glove 1 | 52.4 | 40.6 |
Joulin et al., 2016 | Fastext 1 | 65.3 | 53.6 | ||
Salle et al., 2016 | LexVec 1 | 68.9 | 55.8 | ||
Mikolov et al. 13 | Word2vec skipgram 1 | 70.0 | 56.5 | ||
Duma and Menzel, 17 | Doc2Vec (SEF@UHH3) | 61.6 | 59.2 | ||
Pham et al. 15 | C-PHRASE 1 | 74.3 | 63.9 | ||
Le and Mikolov, 14; Lau and Baldwin, 16 | PV-DBOW Paragraph vectors, Doc2Vec DBOW 1 | 72.2 | 64.9 | ||
Wieting et al. 16b | Charagram (uses PPDB) 2 | 76.8 | 71.6 | ||
Wieting et al. 16a | Paragram-Phrase (uses PPDB) 2 | 73.9 | 73.2 | ||
Conneau et al. 17 | InferSent (bi-LSTM trained on SNLI) 2 | 80.1 | 75.8 | ||
Pagliardini et al. 17 | Sent2vec 1 | 78.7 | 75.5 | ||
Yang et al. 2018 | Conversation response prediction + SNLI | 81.4 | 78.2 | ||
Ethayarajh et al. 2018 | Unsupervised SIF on ParaNMT vectors 2 | 84.2 | 79.5 | ||
Dev | Arora et al. 17 | SIF on Glove vectors 1 | 80.1 | 72.0 | |
Wieting et al. 17 | GRAN (uses SimpWiki) 2 | 81.8 | 76.4 | ||
Train | Tai et al. 15 | LSTM 1 | 75.0 | 70.5 | |
Tai et al. 15 | BiLSTM 1 | 76.0 | 71.1 | ||
Tai et al. 15 | Dependency Tree-LSTM 1 | 76.0 | 71.2 | ||
Tai et al. 15 | Constituency Tree-LSTM 1 | 77.0 | 71.9 | ||
Yang et al. 2018 | Conversation response prediction + SNLI | 83.5 | 80.8 | ||
Other | Train | Yang, 17 | CNN (HCTI3) | 83.4 | 78.4 |
Sentence Representation | Supervision | Paper | Comments | Dev | Test |
---|---|---|---|---|---|
Other | Train | Al-Natsheh et al. 17 | mixed ensemble (UDL3) | 79.0 | 72.4 |
Maharjan et al. 17 | mixed ensemble (DT_TEAM3) | 83.0 | 79.2 | ||
Wu et al. 17 | WordNet+Alignment+Embeddings (BIT3) | 82.9 | 80.9 | ||
Tian et al. 17 2 | mixed ensemble (ECNU3) | 84.7 | 81.0 |
Notes:
1 Software trained and tested by us (see details)
2 Results reported by personal communication
3 SemEval 2017 participant team
Companion
The companion datasets to the STS Benchmark comprise the rest of the English datasets used in the STS tasks organized by us in the context of SemEval between 2012 and 2017.
We collated two datasets, one with pairs of sentences related to machine translation evaluation. Another one with the rest of datasets, which can be used for domain adaptation studies.
Download here.
For reference, this is the breakdown according to the original names and task years of the datasets:
MT-related datasets: sts-mt.csv
file years pairs ----------------------------- SMTnews 2012 399 SMTeuroparl 2012 1293 postediting 2016 244
Other datasets: sts-other.csv
file years pairs ----------------------------- OnWN 2012 750 OnWN 2013 561 OnWN 2014 750 FNWN 2013 189 tweet-news 2014 750 belief 2015 375 plagiarism 2016 230 question-question 2016 209
Note, the 2013 SMT dataset is available through LDC only.
Reference
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia (2017) SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Cross-lingual Focused Evaluation Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2017) [1]