Inventario recursos

From KNOW 2
Jump to: navigation, search

This is an inventory of linguistic processing resources and tools. The main goal of this directory is to keep a list of those resources and tools which are relevant for KNOW2.

Nota para miembros Know2:
* Id actualizando los recursos, marcando grupo responsable (EHU / UPC / UB / UoC)
* Añadir la información necesaria
* Si el recurso se desarrolla en KNOW2, poned el número de experimento
* Para describir recursos hay tres opciones:
  ** Hacer una breve descripcion, más un link.
  ** Reutilizar alguna ficha descriptiva para otro proyecto (ej. Clarin), con lo que basta un link. 
  ** Construir un XX resource file. Es importante que esa hoja wiki acabe en "resource file" porque sino no se hace público (estaría restringido a usuarios Know2).
* Para preguntas y demás: Eli Comelles  elicomelles EN gmail PUNTO com .

tokenization and sentence boundaries

morphological analysis

bilingual terminology extractors

  • Multilingual
    • MCR
    • Terminology Extraction Suite developed by Antoni Oliver (GRIAL/UOC) download
  • EU
    • The Basic Encyclopedic Dictionary of Science and Technology (BDST): The BDST is an specialized dictionary published on line by Elhuyar. Each concept is represented using a terminological record, which includes all the information relating to that concept: the terms that convey the concept, the definition, the domain(s), etc. The current amount of concepts is 15,627.
    • BDST

named entity recognition and classification

shallow syntactic analysis

deep syntactic analysis

word sense disambiguation

treatment of referential expressions

semantic lexicons

  • multilingual
    • MCR Multilingual Central Repository (see Section resources generated in KNOW), integrates in the EuroWordnet framework, via the Interlingual Index:
      • five local wordnets and six versions of English WordNet,
      • WordNet Domains (Magnini and Cavaglià 2000),
      • new versions of the Base Concepts and the Top Concept Ontology (Álvez et al. 2008),
      • the SUMO ontology (Niles and Pease 2001),
      • and hundreds of thousands of automatically acquired semantic relations, making MCR four times bigger than WordNet 3.0 (934.771 vs. 235.402 unique semantic relations).
    • TCO Top Concept Ontology, an extension of WordNet where nouns have been annotated using the semantic features defined in the EuroWordNet Top Concept Ontology. Since this extension is alligned to the EuroWordNet's Interlingual Index, it can be also used to populate any other wordnet linked to it. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks. See Section resources generated in KNOW.

It can be consulted in: http://adimen.si.ehu.es/cgi-bin/wei/public/wei.consult.perl , and it can be downloaded from: http://lpg.uoc.edu/files/wei-topontology.2.3.rar . See Section resources generated in KNOW.

syntactically and semantically hand-tagged corpora

resources generated in KNOW

  • MCR Multilingual Central Repository, integrates in the EuroWordnet framework, via the Interlingual Index:
    • five local wordnets and six versions of English WordNet,
    • WordNet Domains (Magnini and Cavaglià 2000),
    • new versions of the Base Concepts and the Top Concept Ontology (Álvez et al. 2008),
    • the SUMO ontology (Niles and Pease 2001),
    • and hundreds of thousands of automatically acquired semantic relations, making MCR four times bigger than WordNet 3.0 (934.771 vs. 235.402 unique semantic relations).
  • TCO Top Concept Ontology, an extension of WordNet where nouns have been annotated using the semantic features defined in the EuroWordNet Top Concept Ontology. Since this extension is alligned to the EuroWordNet's Interlingual Index, it can be also used to populate any other wordnet linked to it. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks. It can be consulted in: http://adimen.si.ehu.es/cgi-bin/wei/public/wei.consult.perl , and it can be downloaded from: http://lpg.uoc.edu/files/wei-topontology.2.3.rar .
  • KnowNet an extension of WordNet where topical relations between synsets are added. It is automatically created by semantically disambiguating small portions of Topic Signatures acquired from the web (Agirre and de la Calle, 2004), then connecting large sets of topically-related concepts.
  • Selectional restrictions for English from PropBank (EHU)
  • Automatic Selection of Basic Level Concepts (BLC) a method for the automatic selection of BLC from WordNet. The Basic Level Concepts (BLC) are those concepts that are frequent and salient; they are neither overly general nor too specific.
  • EuSemCor a 300,000 word semantically disambiguated corpus of basque, using EuroWordNet as the sense inventory (under construction).
  • Basque WordNet Basque part of EuroWordNet.
  • (UB-UPC) dependency grammars for Catalan and Spanish in Freeling broad-coverage dependency grammars for Catalan and Spanish, available in the 2.1-alpha version of Freeling. Alternatively, development versions are available from the Freeling SVN.
  • Graph-based Word Sense Disambiguation
  • Graph-based word similarity, work in progress.
  • (UPC) Wikicorpus Catalan, Spanish and English portions of the Wikipedia.
  • (UPC) FL-3.0 version, under advanced development. New features:
    • Full suport for UTF-8, enabling Freeling development for non-latin alphabet languages.
    • Integration of a language detection module.
    • Integration of a classification generic module for SVM. Testing period for NER/NEC.
    • Development of a simplest and lightest new feature structure.
  • (UPC) DCA Corpus Freely available trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia and has been automatically enriched with linguistic information. The corpora have been annotated with lemma and part of speech information using the open source library Freeling. Also they have been sense annotated with the state of the art Word Sense Disambiguation algorithm UKB.