UKB: Graph Based Word Sense Disambiguation and Similarity
UKB is a collection of programs for performing graph-based Word
Disambiguation (WSD) and lexical similarity/relatedness using a pre-existing
knowledge base. UKB applies random walks, e.g. Personalized PageRank, on the
Knowledge Base (KB) graph to rank the vertices according to the given context. It includes tools to produce graphs from KBs like WordNet.
UKB has been evaluated in several tasks:
- WSD using WordNet (English), out-of-the-box state-of-the-art results [17,8,1].
- WSD on specific domains 
- WSD on several languages (Basque, Bulgarian, Portuguese, Spanish) 
- WSD on the medical domain, using the UMLS meta-thesaurus [5,7,9]
- Word similarity using WordNet or Wikipedia graphs, state-of-the-art-results [3,4,10]
- Named-Entity Disambiguation using the Wikipedia graph [10,13]
- Improvements on Information Retrieval [6,9,12] using WordNet or UMLS
- Word Embeddings produced with random walks on WordNet and
dimensionality reduction techniques, state-of-the-art in monolingual and cross-lingual word similarity
[11,14,16]. (See downloads below)
UKB has been developed by the IXA
group in the University of the Basque Country.
The current version of UKB is 3.1. You can download it here
04.10.2018 Version of UKB 3.1 is
released, with out-of-the-box state-of-the-art knowlege-based WSD results. Click here to download UKB
(unix/linux version only).
12.01.2015 Word embeddings released (see below)
Click here to download older versions of
Please, pose any questions/problems you may have using
the following mailing
Source code repository
the git source code repository is at github
using git, you can get the whole repository running:
- git clone https://github.com/asoroa/ukb.git
- Click here to get graph relations of some versions of the English WordNet.
Click here to get graph relations of some versions of the Spanish WordNet.
Click here to get
graph relations for English Wikipedia (04 April 2013 dump).
- English WordNet 3.0 plus gloss relations: here
- English WordNet 1.7 plus eXtended WordNet relations: here
- WordNet 3.0 ILI version with dictionaries in English, Spanish
and Basque: here
- English Wikipedia: here
- Basque Wikipedia: here
Monolingual word Embeddings:
- Precompiled Personalized PageRank vectors for all WordNet lemmas (around 1.2G), useful to speed up similarity calculations. Click here
Bilingual word Embeddings:
- Embeddings for English WordNet 3.0 (plus gloss relations): text, binary 
- Concatenated embedding for Text Corpora and English WordNet 3.0 (plus gloss relations): text 
- Bilingual Embeddings with Random Walks over Multilingual WordNets. Relevant material to reproduce the experiments in : here
- Visit this page for additional relations extracted by the people at the BulkTreeBank Group within the Qtleap project.
 Eneko Agirre and Aitor Soroa. 2009.
Personalizing PageRank for Word Sense Disambiguation. Proceedings of
12th conference of the European chapter of the Association for
Linguistics (EACL-2009). Athens,
Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2009.
Knowledge-based WSD and specific domains: performing over supervised
WSD. Proceedings of IJCAI. Pasadena, USA. (PDF)
 Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius
Pasca and Aitor Soroa. 2009. A Study on Similarity and Relatedness
Distributional and WordNet-based Approaches. Proceedings of NAACL-HLT
09. Boulder, USA. (PDF)
 Eneko Agirre, Montse Cuadros, German Rigau and Aitor Soroa.
2010. Exploring Knowledge Bases for Similarity. Proceedings of
LREC 2010. Valletta, Malta. (PDF)
 Eneko Agirre, Aitor Soroa, Mark Stevenson. 2010. Graph-based Word
Sense Disambiguation of Biomedical Documents. Bioinformatics, Oxford
University Press. Bioinformatics Vol. 26(22) pp: 2889-2896
 Arantxa Otegi, Xabier Arregi, Eneko Agirre. 2011. Query Expansion
for IR using Knowledge-Based Relatedness. Proceedings of the 5th
International Joint Conference on Natural Language Processing, pp
1467--1471 Thailand. ISBN 978-974-466-564-5. (PDF)
 Mark Stevenson, Eneko Agirre and Aitor Soroa 2012. Exploiting
Domain Information for Word Sense Disambiguation of Medical Documents.
Journal of the American Medical Informatics Association. Vol. 19,
Issue 2, pages 235-240. (DOI http://dx.doi.org/10.1136/amiajnl-2011-000415)
 Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2013. Random
Walks for Knowledge-Based Word Sense Disambiguation. Computational
Linguistics. 40:1. ISSN
 David Martinez, Arantxa Otegi, Aitor Soroa and Eneko Agirre. 2014.
Improving search over Electronic Health Records using UMLS-based query expansion through random walks.
Journal of Biomedical Informatics, vol. 51, pages 100-106, Elsevier. (PDF)
 Eneko Agirre, Ander Barrena and Aitor Soroa. 2015. Studying the
Wikipedia Hyperlink Graph for Relatedness and Disambiguation.
http://arxiv.org/abs/1503.01655 (See README for instructions to replicate results)
 Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2015. Random Walks and Neural Network Language Models
on Knowledge Bases. Proceedings of NAACL-HLT. Denver, USA. (PDF, WordNet embeddings in text format, WordNet embeddings in binary format)
 Otegi A., Arregi X., Ansa O., Agirre E. 2015
Using Knowledge-Based Relatedness for Information Retrieval.
Journal of Knowledge and Information Systems, Springer London, vol. 44, issue 3, pages 689-718. (final, preprint)
 Ander Barrena, Aitor Soroa, Eneko Agirre 2015. Combining Mention
Context and Hyperlinks from Wikipedia for Named Entity Disambiguation. Proceedings of STARSEM 2015. (PDF)
 Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2016. Single or Multiple? Combining Word Representations
Independently Learned from Text and WordNet. Proceedings of AAAI. Phoenix, USA. (PDF, Concatenated embeddings)
 Eneko Agirre, Steve Neale, Michal Novak, Arantxa Otegi, Joao Silva, Kiril Simov and Roman Sudarikov, 2015. Report on pilot version of LRTs enhanced to support advanced crosslingual ambiguity resolution. Deliverable D5.6, Version 1.4, QTLeap Project. (PDF).
 Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2018. Bilingual embeddings with random walks over multilingual wordnets, 2018. Knowledge-Based Systems (KNOSYS). (preprint final reproducibility material)
 Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2018. The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD. NLP-OSS workshop at ACL (arXiv:1805.04277)
This work has been partially funded by European Community in the
framework of ERA-NET CHIST-ERA Commission (project READERS) and
and the European Commission (QTLEAP FP7-ICT-2013.4.1-610516).