-------------------------------------------------------------------------- Topic signatures for all nominal senses of polysemous nouns in WordNet 1.6 -------------------------------------------------------------------------- LEMMATIZED AND FILTERED VERSION ---------- *CONTENTS: ---------- LICENSE.TXT TOPICSIGNATURESLEM.README.TXT signatures_lem.tar (See INSTALATION below) -------------- *INTRODUCTION: -------------- This file reports the method to build topic signatures for the word senses of english nouns (monosemous and polysemous) in WordNet version 1.6. We cover 98.2% of all nominal senses, and 93.5% of the senses ofenglish nouns from WordNet version 1.6. The examples for each noun have been acquired using the monosemous relative method. That is, to acquire examples for a polysemous noun we use its monosemous relatives (hypernyms, hyponyms and synonyms) to construct the queries. Thousands of snippets are retrieved using Google to build context files. Google provides snippets that, if carefully analized and filtered, provide the sentence context where the target query is found. The raw snippets are provided in a separate file. The method to construct the topic signatures proceeds as follows: (a) We first lemmatize and organize the examples collected from the web in the collections, one collection per word sense. (b) For each collection we extract the LEMMAS, their frequencies, and compare them with the data in the collections pertaining to the other word senses using the tf.idf measure. (c) the words that have distinctive frequency for one of the collection are collected in a list, which constitutes the topic signature for the respective word sense. (d) The topic signatures for the word senses are filtered with the coocurrence list of the target word taken from the balanced Corpora such as the BNC. ------------- *INSTALATION: ------------- - Size: 1.5 GB - For instalation type this command: $ tar -xvf signatures_lem.tar - The files are ordered alphabetically, so, each file is stored in a subdirectory according to the noun's first character. - Each file is named according to the following pattern: measure.BNCfilt.target_word.PoS.sense.txt.bz2 for instance: tf_idf.BNCfilt.church.n.1.txt.bz2, which corresponds to the topic signature of the first sense of church built using the tf.idf measure. - Each file is compresed with bzip2 (http://sources.redhat.com/bzip2/) - In order to keep the size of the corpus down, we keep the files compressed, and use the following code in Perl for oppening files: open(I,"bzip2 $file|") or die $!; ------------ *STATISTICS: ------------ Number of polysemous words: 15,560 Number of constructed topic signatures: 35,183 average number of topic signature per word: 2.26 noun senses without examples: 2,442 (% 6.5) average size per signature (in words): 1,007 average size per signature (without zero weight values): 629 -------- *CONTACT -------- Oier Lopez de Lacalle: jibloleo@si.ehu.es Published work containing results derived from use of this database must contain an appropriate acknowledgement: Agirre E., Lopez de Lacalle Lekuona O. 2004 Publicly available topic signatures for all WordNet nominal senses Proceedings of the 4rd International Conference on Languages Resources and Evaluations (LREC). Lisbon, Portugal. This paper (and the ones below) is also available in http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1078831291/publikoak/ SENSECORPUS HAS BEEN USED IN THE FOLLOWING RESEARCH: Agirre E., Lopez de Lacalle Lekuona O. 2003 Clustering WordNet Word Senses Proceedings of the Conference on Recent Advances on Natural Language Processing (RANLP'03). Bulgary. A shorter version is published in Nicolas Nicolov, Kalina Botcheva, Galia Angelova and Ruslan Mitkov (2004). Recent Advances in Natural Language Procesing. John Benjamins Publishers, Amsterdam. Agirre E., Alfonseca E. and Lopez de Lacalle O. 2004 Approximating hierachy-based similarity for WordNet nominal synsets using Topic Signatures Proc. of the 2nd Global WordNet Conference. Brno, Czech Republic. http://www.fi.muni.cz/gwc2004/proc/index.html © Masaryk University Brno