--------------------------------------- Snippets for all nouns in WordNet 1.6 --------------------------------------- ---------- *CONTENTS: ---------- LICENSE.TXT SNIPPETS.README.TXT snippets.tar (see instalation section below) -------------- *INTRODUCTION: -------------- We have retrieved examples for all nouns (monosemous and polysemous) from WordNet version 1.6. We use the offline XML interface kindly provided by Google to collect example for constructing snippet corpus. In order to avoid retrieving full documents, which is time consuming, we take the context from snippets returned by Google. The snippets returned by Google (up to 1,000 examples per query) are processed, and we try to extract sentences (or fragments of sentences) containing the search term from the snippets. The sentence (or fragment) is usually marked by three dots in the snippet. Some potential sentences are discarded, according to the following heuristic: - length shorter than 6 words. - the number of non-alphanumeric characters is greater than the number of words divided by 2. - the number of words in uppercase is greater then those in lowercase. Using only the monosemous nouns we can achieve a monosemous corpus, which we can use to built topic signatures for all polysemous nouns from WordNet version 1.6. ------------------- *SNIPPETS EXAMPLES: ------------------- The examples follow this format (fields separated by tabs): {search_engine} {URL} {target_word_in_cgi_format} {snippets} {results_page_number} Below we can see how a snippet looks: google http://www.metmuseum.org/ %22art%22 The Metropolitan Museum of Art Web site features information on upcoming museum
events, fine art exhibits, special exhibitions, the Met collection and art ... 0 ------------- *INSTALATION: ------------- - Size: 4.7 GB $ cat snippetsa* > snippets.tar - After downloading all the chunks, you have to join them again in order to be able to untar the .tar file. Type this command: $ cat snippetsa* > snippets.tar If it does not work, type: $ cat snippetsaa snippetsab snippetsac snippetsad snippetsae > snippets.tar (respect the order) - For instalation type this command: $ tar -xvf snippets.tar - The files are ordered alphabetically, so, each file is stored in a subdirectory according to the noun's first character. - Each file is named according to the following pattern: target_word_in_cgi_format.PoS.snippett.txt.bz2 for instance: %22systems+analysis%22.n.snippets.txt.bz2 - Each file is compresed with bzip2 (http://sources.redhat.com/bzip2/) - In order to keep the size of the corpus down, we keep the files compressed, and use the following code in Perl for oppening files: open(I,"bzip2 $file|") or die $!; ------------ *STATISTICS: ------------ - Monosemous nouns: · number of monosemous nouns: 91,884 . number of examples: 62,745,293 . size in words: 1,678,759,964 . average examples per m. noun: 682.87 - Polysemous nouns: · number of monosemous nouns: 15,607 . number of examples: 13,869,675 . size in words: 376,335,658 . average examples per word: 888.68 -------- *CONTACT -------- Oier Lopez de Lacalle: jibloleo@si.ehu.es Published work containing results derived from use of this database must contain an appropriate acknowledgement: Agirre E., Lopez de Lacalle Lekuona O. 2004 Publicly available topic signatures for all WordNet nominal senses Proceedings of the 4rd International Conference on Languages Resources and Evaluations (LREC). Lisbon, Portugal. This paper is also available in http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1078831291/publikoak/