STS benchmark reproducibility

From stswiki
Jump to: navigation, search

All systems run by the organizers used canonical pre-trained models made available by the originator of each method (see Source section below), with the exception of PV-DBOW that uses the model from Lau and Baldwin (2016) and InferSent which was reported independently. When multiple pre-trained models are available for a method, we report results for the one with the best dev set performance. For each method, input sentences are preprocessed to closely match the tokenization of the pre-trained models (see Details section below). Default inference hyperparameters are used unless noted otherwise. The averaged word embedding base- lines compute a sentence embedding by averaging word embeddings and then using cosine to com- pute pairwise sentence similarity scores.


sent2vec: sent2vec, trained model sent2vec_twitter_unigrams;

SIF: Wikipedia trained word frequencies enwiki_vocab min200.txt, embeddings from lexvec.commoncrawl.300d.W+C.pos.vectors, first 15 principle components removed, α = 0.001, dev experiments varied α, principle components removed and whether GloVe, LexVec, or Word2Vec word embeddings were used;


PV-DBOW:, AP-NEWS trained apnews_dbow.tgz;

LexVec:, embedddings lexvec.commoncrawl.300d.W.pos.vectors.gz;

FastText:, Wikipedia trained embeddings from wiki.en.vec;

Paragram: ̃wieting/, embeddings trained on PPDB and tuned to WS353 from Paragram-WS353;

GloVe:, Wikipedia and Gigaword trained 300 dim. embeddings from;

Word2vec: p/word2vec/, Google News trained embeddings from GoogleNews-vectors-negative300.bin.gz


sent2vec: results shown here tokenized by tweetTok- constrasting dev experiments used, both distributed with sent2vec.

LexVec: numbers were con- verted into words, all punctuation was removed, and text is lowercased;

FastText: Since, to our knowledge, the tokenizer and preprocessing used for the pre-trained FastText embeddings is not publicly described. We use the follow- ing heuristics to preprocess and tokenize sentences for Fast-Text: numbers are converted into words, text is lowercased, and finally prefixed, suffixed and infixed punctuation is re- cursively removed from each token that does not match an entry in the model’s lexicon;

Paragram: Joshua (Matt Post, 2015) pipeline to pre-process and tokenized English text;

C-PHRASE, GloVe, PV-DBOW & SIF: PTB tokenization provided by Stanford CoreNLP (Manning et al., 2014) with post-processing based on dev OOVs;

Word2vec: Similar to Fast-Text, to our knownledge, the preprocessing for the pre-trained Word2vec embeddings is not publicly described. We use the following heuristics for the Word2vec experiment: All num- bers longer than a single digit are converted into a ‘#’ (e.g., 24 → ##) then prefixed, suffixed and infixed punctuation is recursively removed from each token that does not match an entry in the model’s lexicon.