STS benchmark reproducibility
All systems run by the organizers used canonical pre-trained models made available by the originator of each method (see Source section below), with the exception of PV-DBOW that uses the model from Lau and Baldwin (2016) and InferSent which was reported independently. When multiple pre-trained models are available for a method, we report results for the one with the best dev set performance. For each method, input sentences are preprocessed to closely match the tokenization of the pre-trained models (see Details section below). Default inference hyperparameters are used unless noted otherwise. The averaged word embedding base- lines compute a sentence embedding by averaging word embeddings and then using cosine to com- pute pairwise sentence similarity scores.
sent2vec: https://github.com/epfml/ sent2vec, trained model sent2vec_twitter_unigrams;
SIF: https://github.com/epfml/sent2vec Wikipedia trained word frequencies enwiki_vocab min200.txt, https://github.com/alexandres/lexvec embeddings from lexvec.commoncrawl.300d.W+C.pos.vectors, first 15 principle components removed, α = 0.001, dev experiments varied α, principle components removed and whether GloVe, LexVec, or Word2Vec word embeddings were used;
PV-DBOW: https://github.com/jhlau/doc2vec, AP-NEWS trained apnews_dbow.tgz;
LexVec: https://github.com/alexandres/lexvec, embedddings lexvec.commoncrawl.300d.W.pos.vectors.gz;
FastText: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md, Wikipedia trained embeddings from wiki.en.vec;
Paragram: http://ttic.uchicago.edu/ ̃wieting/, embeddings trained on PPDB and tuned to WS353 from Paragram-WS353;
GloVe: https://nlp.stanford.edu/projects/glove/, Wikipedia and Gigaword trained 300 dim. embeddings from glove.6B.zip;
Word2vec: https://code.google.com/archive/ p/word2vec/, Google News trained embeddings from GoogleNews-vectors-negative300.bin.gz
sent2vec: results shown here tokenized by tweetTok- enize.py constrasting dev experiments used wikiTokenize.py, both distributed with sent2vec.
LexVec: numbers were con- verted into words, all punctuation was removed, and text is lowercased;
FastText: Since, to our knowledge, the tokenizer and preprocessing used for the pre-trained FastText embeddings is not publicly described. We use the follow- ing heuristics to preprocess and tokenize sentences for Fast-Text: numbers are converted into words, text is lowercased, and finally prefixed, suffixed and infixed punctuation is re- cursively removed from each token that does not match an entry in the model’s lexicon;
Paragram: Joshua (Matt Post, 2015) pipeline to pre-process and tokenized English text;
C-PHRASE, GloVe, PV-DBOW & SIF: PTB tokenization provided by Stanford CoreNLP (Manning et al., 2014) with post-processing based on dev OOVs;
Word2vec: Similar to Fast-Text, to our knownledge, the preprocessing for the pre-trained Word2vec embeddings is not publicly described. We use the following heuristics for the Word2vec experiment: All num- bers longer than a single digit are converted into a ‘#’ (e.g., 24 → ##) then prefixed, suffixed and infixed punctuation is recursively removed from each token that does not match an entry in the model’s lexicon.