ixaKat

...is a modular chain of Natural Language Processing tools for Basque, developed by the IXA NLP group of the University of the Basque Country.

Tools

The main characteristic of this Natural Language Processing modular chain for Basque is the deep morphosyntactic analysis carried out by the first tool of the chain and the use of these morphologically rich annotations by the following linguistic processing tools of the chain.

It is designed following a modular approach, showing high ease of use of its processors. See full list of features below.

The tools are freely available and ready to use. As they follow a modular approach, the order of the tools is interchangeable, but taking into account the dependencies among them. The guidelines to run each of the tools, and the dependencies, are described on the page of each tool. After following these guidelines if you still have problems, check the contact section of each tool description page. If not, you can also post on our forum.

If you use ixaKat in your research, please cite our paper as detailed here.

Four tools have been adapted and integrated to the chain so far, but some other tools are being integrated and will be available soon. Nevertheless, it is possible to extend the linguistic processing obtained using ixaKat chain with IXA pipes tools. IXA pipes is a modular set of multilingul natural language processing tools, and it offers some tools for Basque. As both ixaKat and IXA pipes are modular, and the input/output format for both is NAF format, it is possible the interaction between tools from both sets in the same processing pipeline.

Features

It is a modular chain, so the tools can be picked and changed, as long as they read and write the required data format via the standard streams. The processors interact like Unix pipes, specifically they all take standard input, do some linguistic processing, and produce standard output which feeds directly the next one.

NAF format is used to represent and pipe linguistic annotations. NAF is a linguistic annotation format designed for complex NLP pipelines. In that way, by default, the input and output of all the tools is formatted in NAF, except for the input of the first one, ixa-pipe-pos-eu, which takes raw text as input.

Transmision of morphologically rich annotations among tools.

All the tools work with UTF-8 character encoding.

Publicly available tools (binary tarballs).

Distributed under a free software license, GPL v3.

The annotation chain can be applied to a single sentence, a paragraph or whole document.

Robustness of the chain is already being tested doing extensive processing.

Minimal installation or preparation effort in order to get started using the tools.

Using a simple command-line interface, running the chain is easy.