----------------------------------------------------------------------
	Sense corpus for all nominal senses of polysemous nouns in WordNet 1.6 
	----------------------------------------------------------------------


----------
*CONTENTS:
----------

LICENSE.TXT             License text
SENSECORPUS.README.TXT  This File
Monosemous_counts.txt	Detailed statistics about sensecorpus of monosemous nouns
Polysemous_counts.txt	Detailed statistics about sensecorpus of monosemous nouns
sensecorpus.tar         Sensecorpus (see INSTALATION below)


--------------
*INTRODUCTION:
--------------

This file reports the method to build sensecorpus, a set of examples for the
word senses of english nouns (monosemous and polysemous) in WordNet version
1.6. We cover 98.2% of all nominal senses, and 93.5% of the senses of
polysemous nouns.

The examples for each noun have been acquired using the monosemous relative
method. That is, to acquire examples for a polysemous noun we use its
monosemous relatives (hypernyms, hyponyms and synonyms) to construct the
queries. Thousands of snippets are retrieved using Google to build context
files. Google provides snippets that, if carefully analized and filtered,
provide the sentence context where the target query is found. The raw
snippets are provided in a separate file.


--------------------
*WHAT IS MONOSEMOUS?
--------------------

We say that a word is monosemous if it has a unique sense, that
is, if a word has a unique synset taking into account all its part of
speech.

For instance, a noun that had a unique sense could be monosemous, but
if it had other senses in other parts of speech, we do not consider it
a monosemous word.  Lets see an example:

Looking the output of WordNet for "8" we can find:

<< The noun 8 has 1 sense (first 1 from tagged texts)
                                               
   1. eight, 8, VIII, eighter, eighter from Decatur, octad, ogdoad,
   octonary, octet-- (the cardinal number that is the sum of seven and
   one)

   The adj 8 has 1 sense (first 1 from tagged texts)
                                                
   1. eight, 8, viii -- (being one more than seven)
>>

Therefore, we do not consider "eight" monosemous.

---------
*METHODS
---------

Each monosemous relative have an associated number (~ distance): the higher
the distance, the less reliable the relative.

  - 0: synonyns.
  - 1: direct hyponyms.
  - 2: hypernyms, indirect hyponyms (at distance 2).
  - 3: siblings, indirect hyponyms (at distance 3).

The number of examples retrieved for each monosemous term is 1000, when
available.
	
In the case of monosemous nouns we have retrieved its own examples, and
that's it.

For each word sense of polysemous nouns we have retrieved its monosemous
relatives. Note that some word senses do not have monosemous relatives, and
have not associated examples in sensecorpus. These constitute 6.5% of all
word senses of polysemous nouns.


-------------
*INSTALATION:
-------------
	
 - Size: 7.1 GB (divided in eight chunks)

- After downloading all the chunks, you have to join them again in order
  to be able to untar the .tar file. Type this command:
    $ cat sensecorpusa* > sensecorpus.tar

   If it does not work, type: 	
    $ cat sensecorpusaa sensecorpusab sensecorpusac sensecorpusad sensecorpusae sensecorpusaf sensecorpusag sensecorpusah > sensecorpus.tar (respect the order)

 - For instalation type this command:
    $ tar -xvf sensecorpus.tar

 - The files are ordered alphabetically, so, each file is stored in
   a subdirectory according to the noun's first character.

 - Each file is named according to the following pattern: 

   PoS#target_word#sense.method.sentence.bz2

   for instance: n#art#3.0.txt.sentence.bz2, which corresponds to the
   examples retrieved using synonyms (0) of the 3rd sense of art.

 - Each file is compresed with bzip2 (http://sources.redhat.com/bzip2/)

 - In order to keep the size of the corpus down, we keep the files
   compressed, and  use the following code in Perl for oppening files:
      
      open(I,"bzip2 $file|") or die $!;


-----------
*STATISTICS:
-----------

- Monosemous nouns:
	· number of monosemous nouns:                   91,884
	· number of examples (non filtered):        62,745,293
	·       per monosemous noun:                      682.9
	· number of examples (filtered):            14,664,798
	·     "  per monosemous noun:                      159.6


- Polysemous nouns statistics:
	· number of polysemous nouns:                    15,835
	· number of senses:                              37,678
	· number of examples (filtered):            149,231,586
	· number of examples (1 method):              2,796,300
	· number of examples (2 method):             11,710,483
	· number of examples (3 method):             12,466,774
	· number of examples (4 method):            122,258,029
	· average number of senses:                           2.38
	· average number of examples per sense:            3960.7
        · average number of examples per sense (1 method):   74.2
	· average number of examples per sense (2 method):  310.8
	· average number of examples per sense (3 method):  330.9
	· average number of examples per sense (4 method): 3244.8
	· noun senses without examples:                    2442 (6.5%)


--------
*CONTACT
--------

 Oier Lopez de Lacalle: jibloleo@si.ehu.es


Published work containing results derived from use of this database must
contain an appropriate acknowledgement:

Agirre E., Lopez de Lacalle Lekuona O. 2004
Publicly available topic signatures for all WordNet nominal senses
Proceedings of the 4rd International Conference on Languages Resources and
Evaluations (LREC). Lisbon, Portugal.

 This paper is also available in 
 http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1078831291/publikoak/


SENSECORPUS HAS BEEN USED IN THE FOLLOWING RESEARCH:

Agirre Eneko and David Martinez, 2004
Unsupervised WSD based on automatically retrieved examples: The importance of bias
Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP). Barcelona, Spain.

  This paper is available in
  http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1087372032/publikoak/emnlp.pdf

Agirre, Eneko and David Martinez, 2004. 
The effect of bias on an automatically-built word sense corpus
Proceedings of the 4rd International Conference on Language Resources and
Evaluations (LREC)

  This paper is available in
  http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1080312412/publikoak/