STSbenchmark

In order to provide a standard benchmark to compare among meaning representation systems in future years, we organized it into train, development and test. The development part can be used to develop and tune hyperparameters of the systems, and the test part should be only used once for the final system. Some authors (e.g. Arora et al. 2017; Mu et al. 2017; Wieting et al. 2016) report the results across different years, with a mixture of many genres and training conditions. STS benchmark provides a standard setup for training, development and test on three selected genres (news, captions, forums).

See SemEval 2017 task paper (Section 8 here).

Download here.

Also included in SentEval

The benchmark comprises 8628 sentence pairs. This is the breakdown according to genres and train-dev-test splits:

               train  dev test total 
       -----------------------------
       news     3299  500  500  4299
       caption  2000  625  625  3250
       forum     450  375  254  1079
       -----------------------------
       total    5749 1500 1379  8628

For reference, this is the breakdown according to the original names and task years of the datasets:

 genre     file           years   train  dev test
 ------------------------------------------------
 news      MSRpar         2012     1000  250  250
 news      headlines      2013-16  1999  250  250 
 news      deft-news      2014      300    0    0
 captions  MSRvid         2012     1000  250  250
 captions  images         2014-15  1000  250  250
 captions  track5.en-en   2017        0  125  125
 forum     deft-forum     2014      450    0    0
 forum     answers-forums 2015        0  375    0
 forum     answer-answer  2016        0    0  254

Results

We report systems in two tables: In the top table we report systems that use neural representation models alone, and in the other table we report feature engineered and mixed systems.

For each system we further detail two traits:

Sentence representation model used in the system:
- Independent: systems that are solely based on a pair of sentence representations that are computed independently of one another
- Other: systems that also use interactions between sentences (e.g. alignments, attention or other features like word overlap)
Amount of supervision used by the system:
- Unsupervised: systems that do not use any STS train or development data (can include transfer learning, or resources like WordNet or PPDB)
- Dev: systems that only use the STS benchmark development data (weakly supervised)
- Train: systems that only use the STS benchmark training and development data (fully supervised)
- Unconstrained: systems that in addition to STS benchmark use other STS training data like SICK or SemEval 2015 task 1

Neural representation models
Sentence Representation	Supervision	Paper	Comments	Dev	Test
Independent	Unsupervised	Pennington et al. 14	Glove ¹	52.4	40.6
		Joulin et al., 2016	Fastext ¹	65.3	53.6
		Salle et al., 2016	LexVec ¹	68.9	55.8
		Mikolov et al. 13	Word2vec skipgram ¹	70.0	56.5
		Duma and Menzel, 17	Doc2Vec (SEF@UHH³)	61.6	59.2
		Pham et al. 15	C-PHRASE ¹	74.3	63.9
		Le and Mikolov, 14; Lau and Baldwin, 16	PV-DBOW Paragraph vectors, Doc2Vec DBOW ¹	72.2	64.9
		Wieting et al. 16b	Charagram (uses PPDB) ²	76.8	71.6
		Wieting et al. 16a	Paragram-Phrase (uses PPDB) ²	73.9	73.2
		Conneau et al. 17	InferSent (bi-LSTM trained on SNLI) ²	80.1	75.8
		Pagliardini et al. 17	Sent2vec ¹	78.7	75.5
		Yang et al. 2018	Conversation response prediction + SNLI	81.4	78.2
		Ethayarajh et al. 2018	Unsupervised SIF on ParaNMT vectors ²	84.2	79.5
	Dev	Arora et al. 17	SIF on Glove vectors ¹	80.1	72.0
	Dev	Wieting et al. 17	GRAN (uses SimpWiki) ²	81.8	76.4
	Train	Tai et al. 15	LSTM ¹	75.0	70.5
		Tai et al. 15	BiLSTM ¹	76.0	71.1
		Tai et al. 15	Dependency Tree-LSTM ¹	76.0	71.2
		Tai et al. 15	Constituency Tree-LSTM ¹	77.0	71.9
		Yang et al. 2018	Conversation response prediction + SNLI	83.5	80.8
Other	Train	Yang, 17	CNN (HCTI³)	83.4	78.4

Feature engineered and mixed systems
Sentence Representation	Supervision	Paper	Comments	Dev	Test
Other	Train	Al-Natsheh et al. 17	mixed ensemble (UDL³)	79.0	72.4
		Maharjan et al. 17	mixed ensemble (DT_TEAM³)	83.0	79.2
		Wu et al. 17	WordNet+Alignment+Embeddings (BIT³)	82.9	80.9
		Tian et al. 17 ²	mixed ensemble (ECNU³)	84.7	81.0

Notes:

1 Software trained and tested by us (see details)

2 Results reported by personal communication

3 SemEval 2017 participant team

Companion

The companion datasets to the STS Benchmark comprise the rest of the English datasets used in the STS tasks organized by us in the context of SemEval between 2012 and 2017.

We collated two datasets, one with pairs of sentences related to machine translation evaluation. Another one with the rest of datasets, which can be used for domain adaptation studies.

Download here.

For reference, this is the breakdown according to the original names and task years of the datasets:

MT-related datasets: sts-mt.csv

    file              years pairs
    -----------------------------
    SMTnews            2012   399
    SMTeuroparl        2012  1293
    postediting        2016   244

Other datasets: sts-other.csv

    file              years pairs
    -----------------------------
    OnWN               2012   750
    OnWN               2013   561
    OnWN               2014   750
    FNWN               2013   189
    tweet-news         2014   750
    belief             2015   375
    plagiarism         2016   230
    question-question  2016   209

Note, the 2013 SMT dataset is available through LDC only.

Reference

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia (2017) SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Cross-lingual Focused Evaluation Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2017) [1]

Contact

Eneko Agirre

STSbenchmark

Contents

STS benchmark dataset and companion dataset

Results

Companion

Reference

Contact

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools