---------------------------------------------------------------------- Sense corpus for all nominal senses of polysemous nouns in WordNet 1.6 ---------------------------------------------------------------------- ---------- *CONTENTS: ---------- LICENSE.TXT License text SENSECORPUS.README.TXT This File Monosemous_counts.txt Detailed statistics about sensecorpus of monosemous nouns Polysemous_counts.txt Detailed statistics about sensecorpus of monosemous nouns sensecorpus.tar Sensecorpus (see INSTALATION below) -------------- *INTRODUCTION: -------------- This file reports the method to build sensecorpus, a set of examples for the word senses of english nouns (monosemous and polysemous) in WordNet version 1.6. We cover 98.2% of all nominal senses, and 93.5% of the senses of polysemous nouns. The examples for each noun have been acquired using the monosemous relative method. That is, to acquire examples for a polysemous noun we use its monosemous relatives (hypernyms, hyponyms and synonyms) to construct the queries. Thousands of snippets are retrieved using Google to build context files. Google provides snippets that, if carefully analized and filtered, provide the sentence context where the target query is found. The raw snippets are provided in a separate file. -------------------- *WHAT IS MONOSEMOUS? -------------------- We say that a word is monosemous if it has a unique sense, that is, if a word has a unique synset taking into account all its part of speech. For instance, a noun that had a unique sense could be monosemous, but if it had other senses in other parts of speech, we do not consider it a monosemous word. Lets see an example: Looking the output of WordNet for "8" we can find: << The noun 8 has 1 sense (first 1 from tagged texts) 1. eight, 8, VIII, eighter, eighter from Decatur, octad, ogdoad, octonary, octet-- (the cardinal number that is the sum of seven and one) The adj 8 has 1 sense (first 1 from tagged texts) 1. eight, 8, viii -- (being one more than seven) >> Therefore, we do not consider "eight" monosemous. --------- *METHODS --------- Each monosemous relative have an associated number (~ distance): the higher the distance, the less reliable the relative. - 0: synonyns. - 1: direct hyponyms. - 2: hypernyms, indirect hyponyms (at distance 2). - 3: siblings, indirect hyponyms (at distance 3). The number of examples retrieved for each monosemous term is 1000, when available. In the case of monosemous nouns we have retrieved its own examples, and that's it. For each word sense of polysemous nouns we have retrieved its monosemous relatives. Note that some word senses do not have monosemous relatives, and have not associated examples in sensecorpus. These constitute 6.5% of all word senses of polysemous nouns. ------------- *INSTALATION: ------------- - Size: 7.1 GB (divided in eight chunks) - After downloading all the chunks, you have to join them again in order to be able to untar the .tar file. Type this command: $ cat sensecorpusa* > sensecorpus.tar If it does not work, type: $ cat sensecorpusaa sensecorpusab sensecorpusac sensecorpusad sensecorpusae sensecorpusaf sensecorpusag sensecorpusah > sensecorpus.tar (respect the order) - For instalation type this command: $ tar -xvf sensecorpus.tar - The files are ordered alphabetically, so, each file is stored in a subdirectory according to the noun's first character. - Each file is named according to the following pattern: PoS#target_word#sense.method.sentence.bz2 for instance: n#art#3.0.txt.sentence.bz2, which corresponds to the examples retrieved using synonyms (0) of the 3rd sense of art. - Each file is compresed with bzip2 (http://sources.redhat.com/bzip2/) - In order to keep the size of the corpus down, we keep the files compressed, and use the following code in Perl for oppening files: open(I,"bzip2 $file|") or die $!; ----------- *STATISTICS: ----------- - Monosemous nouns: · number of monosemous nouns: 91,884 · number of examples (non filtered): 62,745,293 · per monosemous noun: 682.9 · number of examples (filtered): 14,664,798 · " per monosemous noun: 159.6 - Polysemous nouns statistics: · number of polysemous nouns: 15,835 · number of senses: 37,678 · number of examples (filtered): 149,231,586 · number of examples (1 method): 2,796,300 · number of examples (2 method): 11,710,483 · number of examples (3 method): 12,466,774 · number of examples (4 method): 122,258,029 · average number of senses: 2.38 · average number of examples per sense: 3960.7 · average number of examples per sense (1 method): 74.2 · average number of examples per sense (2 method): 310.8 · average number of examples per sense (3 method): 330.9 · average number of examples per sense (4 method): 3244.8 · noun senses without examples: 2442 (6.5%) -------- *CONTACT -------- Oier Lopez de Lacalle: jibloleo@si.ehu.es Published work containing results derived from use of this database must contain an appropriate acknowledgement: Agirre E., Lopez de Lacalle Lekuona O. 2004 Publicly available topic signatures for all WordNet nominal senses Proceedings of the 4rd International Conference on Languages Resources and Evaluations (LREC). Lisbon, Portugal. This paper is also available in http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1078831291/publikoak/ SENSECORPUS HAS BEEN USED IN THE FOLLOWING RESEARCH: Agirre Eneko and David Martinez, 2004 Unsupervised WSD based on automatically retrieved examples: The importance of bias Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Barcelona, Spain. This paper is available in http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1087372032/publikoak/emnlp.pdf Agirre, Eneko and David Martinez, 2004. The effect of bias on an automatically-built word sense corpus Proceedings of the 4rd International Conference on Language Resources and Evaluations (LREC) This paper is available in http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1080312412/publikoak/