UKB: Graph Based Word Sense Disambiguation and Similarity

UKB is a collection of programs for performing graph-based Word Sense Disambiguation (WSD) and lexical similarity/relatedness using a pre-existing knowledge base. UKB applies random walks, e.g. Personalized PageRank, on the Knowledge Base (KB) graph to rank the vertices according to the given context. It includes tools to produce graphs from KBs like WordNet.

UKB has been evaluated in several tasks:

UKB has been developed by the IXA group in the University of the Basque Country.

The current version of UKB is 3.2. You can download it here (older versions are here.

News:

Mailing List

Please, pose any questions/problems you may have using the following mailing list: UKB mailing list

Source code repository

the git source code repository is at github using git, you can get the whole repository running:

Downloads

Selected graphs: More graphs: Word vectors: Monolingual word Embeddings: Bilingual word Embeddings: External Resources:

References

[1] Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for Word Sense Disambiguation. Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics (EACL-2009). Athens, Greece. (PDF)

[2] Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2009. Knowledge-based WSD and specific domains: performing over supervised WSD. Proceedings of IJCAI. Pasadena, USA.  (PDF)

[3] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Proceedings of NAACL-HLT 09. Boulder, USA.  (PDF)

[4] Eneko Agirre, Montse Cuadros, German Rigau and Aitor Soroa. 2010.  Exploring Knowledge Bases for Similarity. Proceedings of LREC 2010. Valletta, Malta.  (PDF)

[5] Eneko Agirre, Aitor Soroa, Mark Stevenson. 2010. Graph-based Word Sense Disambiguation of Biomedical Documents. Bioinformatics, Oxford University Press. Bioinformatics Vol. 26(22) pp: 2889-2896 (DOI http://dx.doi.org/10.1093/bioinformatics/btq555)

[6] Arantxa Otegi, Xabier Arregi, Eneko Agirre. 2011. Query Expansion for IR using Knowledge-Based Relatedness. Proceedings of the 5th International Joint Conference on Natural Language Processing, pp 1467--1471 Thailand. ISBN 978-974-466-564-5. (PDF)

[7] Mark Stevenson, Eneko Agirre and Aitor Soroa 2012. Exploiting Domain Information for Word Sense Disambiguation of Medical Documents. Journal of the American Medical Informatics Association. Vol. 19, Issue 2, pages 235-240. (DOI http://dx.doi.org/10.1136/amiajnl-2011-000415)

[8] Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2013. Random Walks for Knowledge-Based Word Sense Disambiguation. Computational Linguistics. 40:1. ISSN 0891-2017. doi:10.1162/COLI_a_00164

[9] David Martinez, Arantxa Otegi, Aitor Soroa and Eneko Agirre. 2014. Improving search over Electronic Health Records using UMLS-based query expansion through random walks. Journal of Biomedical Informatics, vol. 51, pages 100-106, Elsevier. (PDF)

[10] Eneko Agirre, Ander Barrena and Aitor Soroa. 2015. Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation. http://arxiv.org/abs/1503.01655 (See README for instructions to replicate results)

[11] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2015. Random Walks and Neural Network Language Models on Knowledge Bases. Proceedings of NAACL-HLT. Denver, USA.  (PDF, WordNet embeddings in text format, WordNet embeddings in binary format)

[12] Otegi A., Arregi X., Ansa O., Agirre E. 2015 Using Knowledge-Based Relatedness for Information Retrieval. Journal of Knowledge and Information Systems, Springer London, vol. 44, issue 3, pages 689-718. (final, preprint)

[13] Ander Barrena, Aitor Soroa, Eneko Agirre 2015. Combining Mention Context and Hyperlinks from Wikipedia for Named Entity Disambiguation. Proceedings of STARSEM 2015. (PDF)

[14] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2016. Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet. Proceedings of AAAI. Phoenix, USA. (PDF, Concatenated embeddings)

[15] Eneko Agirre, Steve Neale, Michal Novak, Arantxa Otegi, Joao Silva, Kiril Simov and Roman Sudarikov, 2015. Report on pilot version of LRTs enhanced to support advanced crosslingual ambiguity resolution. Deliverable D5.6, Version 1.4, QTLeap Project. (PDF).

[16] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2018. Bilingual embeddings with random walks over multilingual wordnets, 2018. Knowledge-Based Systems (KNOSYS). (preprint final reproducibility material)

[17] Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2018. The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD. NLP-OSS workshop at ACL (arXiv:1805.04277)

Acknowledgments

This work has been partially funded by European Community in the framework of ERA-NET CHIST-ERA Commission (project READERS) and and the European Commission (QTLEAP FP7-ICT-2013.4.1-610516).

IXA group Readers QTleap