---------------------------------------
Snippets for all nouns in WordNet 1.6
---------------------------------------
----------
*CONTENTS:
----------
LICENSE.TXT
SNIPPETS.README.TXT
snippets.tar (see instalation section below)
--------------
*INTRODUCTION:
--------------
We have retrieved examples for all nouns (monosemous and polysemous)
from WordNet version 1.6.
We use the offline XML interface kindly provided by Google to
collect example for constructing snippet corpus. In order to avoid
retrieving full documents, which is time consuming, we take the
context from snippets returned by Google.
The snippets returned by Google (up to 1,000 examples per query) are
processed, and we try to extract sentences (or fragments of
sentences) containing the search term from the snippets. The sentence
(or fragment) is usually marked by three dots in the snippet. Some
potential sentences are discarded, according to the following
heuristic:
- length shorter than 6 words.
- the number of non-alphanumeric characters is greater than the
number of words divided by 2.
- the number of words in uppercase is greater then those in
lowercase.
Using only the monosemous nouns we can achieve a monosemous corpus,
which we can use to built topic signatures for all polysemous nouns
from WordNet version 1.6.
-------------------
*SNIPPETS EXAMPLES:
-------------------
The examples follow this format (fields separated by tabs):
{search_engine} {URL} {target_word_in_cgi_format} {snippets} {results_page_number}
Below we can see how a snippet looks:
google http://www.metmuseum.org/ %22art%22 The Metropolitan Museum of
Art Web site features information on upcoming museum
events, fine art exhibits, special exhibitions, the Met
collection and art ... 0
-------------
*INSTALATION:
-------------
- Size: 4.7 GB
$ cat snippetsa* > snippets.tar
- After downloading all the chunks, you have to join them again in order
to be able to untar the .tar file. Type this command:
$ cat snippetsa* > snippets.tar
If it does not work, type:
$ cat snippetsaa snippetsab snippetsac snippetsad snippetsae > snippets.tar (respect the order)
- For instalation type this command:
$ tar -xvf snippets.tar
- The files are ordered alphabetically, so, each file is stored in
a subdirectory according to the noun's first character.
- Each file is named according to the following pattern:
target_word_in_cgi_format.PoS.snippett.txt.bz2
for instance: %22systems+analysis%22.n.snippets.txt.bz2
- Each file is compresed with bzip2 (http://sources.redhat.com/bzip2/)
- In order to keep the size of the corpus down, we keep the files
compressed, and use the following code in Perl for oppening files:
open(I,"bzip2 $file|") or die $!;
------------
*STATISTICS:
------------
- Monosemous nouns:
· number of monosemous nouns: 91,884
. number of examples: 62,745,293
. size in words: 1,678,759,964
. average examples per m. noun: 682.87
- Polysemous nouns:
· number of monosemous nouns: 15,607
. number of examples: 13,869,675
. size in words: 376,335,658
. average examples per word: 888.68
--------
*CONTACT
--------
Oier Lopez de Lacalle: jibloleo@si.ehu.es
Published work containing results derived from use of this database must
contain an appropriate acknowledgement:
Agirre E., Lopez de Lacalle Lekuona O. 2004
Publicly available topic signatures for all WordNet nominal senses
Proceedings of the 4rd International Conference on Languages Resources and
Evaluations (LREC). Lisbon, Portugal.
This paper is also available in
http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1078831291/publikoak/