All systems run by the organizers used canonical pre-trained models made available by the originator of each method (see Source section below), with the exception of PV-DBOW that uses the model from Lau and Baldwin (2016) and InferSent which was reported independently. When multiple pre-trained models are available for a method, we report results for the one with the best dev set performance. For each method, input sentences are preprocessed to closely match the tokenization of the pre-trained models (see Details section below). Default inference hyperparameters are used unless noted otherwise. The averaged word embedding base- lines compute a sentence embedding by averaging word embeddings and then using cosine to com- pute pairwise sentence similarity scores.

Source

sent2vec: https://github.com/epfml/ sent2vec, trained model sent2vec_twitter_unigrams;

SIF: https://github.com/epfml/sent2vec Wikipedia trained word frequencies enwiki_vocab min200.txt, https://github.com/alexandres/lexvec embeddings from lexvec.commoncrawl.300d.W+C.pos.vectors, first 15 principle components removed, α = 0.001, dev experiments varied α, principle components removed and whether GloVe, LexVec, or Word2Vec word embeddings were used;

C-PHRASE: http://clic.cimec.unitn.it/composes/cphrase-vectors.html;

PV-DBOW: https://github.com/jhlau/doc2vec, AP-NEWS trained apnews_dbow.tgz;

LexVec: https://github.com/alexandres/lexvec, embedddings lexvec.commoncrawl.300d.W.pos.vectors.gz;

FastText: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md, Wikipedia trained embeddings from wiki.en.vec;

Paragram: http://ttic.uchicago.edu/ ̃wieting/, embeddings trained on PPDB and tuned to WS353 from Paragram-WS353;

GloVe: https://nlp.stanford.edu/projects/glove/, Wikipedia and Gigaword trained 300 dim. embeddings from glove.6B.zip;

Word2vec: https://code.google.com/archive/ p/word2vec/, Google News trained embeddings from GoogleNews-vectors-negative300.bin.gz

Details

sent2vec: results shown here tokenized by tweetTok- enize.py constrasting dev experiments used wikiTokenize.py, both distributed with sent2vec.

LexVec: numbers were con- verted into words, all punctuation was removed, and text is lowercased;

FastText: Since, to our knowledge, the tokenizer and preprocessing used for the pre-trained FastText embeddings is not publicly described. We use the follow- ing heuristics to preprocess and tokenize sentences for Fast-Text: numbers are converted into words, text is lowercased, and finally prefixed, suffixed and infixed punctuation is re- cursively removed from each token that does not match an entry in the model’s lexicon;

Paragram: Joshua (Matt Post, 2015) pipeline to pre-process and tokenized English text;

C-PHRASE, GloVe, PV-DBOW & SIF: PTB tokenization provided by Stanford CoreNLP (Manning et al., 2014) with post-processing based on dev OOVs;

Word2vec: Similar to Fast-Text, to our knownledge, the preprocessing for the pre-trained Word2vec embeddings is not publicly described. We use the following heuristics for the Word2vec experiment: All num- bers longer than a single digit are converted into a ‘#’ (e.g., 24 → ##) then prefixed, suffixed and infixed punctuation is recursively removed from each token that does not match an entry in the model’s lexicon.

STS benchmark reproducibility

Source

Details

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools