LREC98. SALTMIL workshop. Review by Nicholas Ostler (SALTMIL-1998)

Attention: open in a new window. PDFPrintE-mail

Workshop on Language Resources for European Minority Languages
Granada, Spain; 27 May 1998 (morning)

The old quip attributed to Uriel Weinreich, that a language is a dialect with an army and a navy, is being replaced in these progressive days: a language is a dialect with a dictionary, grammar, parser and a multi-million-word corpus of texts-and they'd better all be computer tractable. When you've got all of those, get yourself a speech database, and your language will be poised to compete on terms of equality in the new Information Society.

This is in principle a much more accommodating doctrine than the old one: all that is necessary for the new attributes of linguistic dignity is some money, and a university team with an interest in your language, although (as with the army-and-navy test) it does help if there is some civic unity (to provide a centre of gravity for outliers) and if modern users of the language outnumber the records of its use by ancestors.

Still, even in these informatic days, there remains a sense of 'safety in numbers' for so-called "minority" languages, i.e. those which still fail the army test-and the half-hearted attitude in Ireland means that Irish tends to fall back into this category. (The navy part of the criterion was never serious: look at Hungarian and Czech.) So when ELRA announced a Conference of Language Resources and Engineering, a carnival of words and data-stores which possessed Granada in the last week of May this year, there was a clear role for a separate Workshop on Language Resources for European Minority Languages.

It was one of eight workshops attached to the Conference proper, but the only one which ranged over the whole gamut of resources: all the other workshops were restricted to one or other technical aspect of the vast task of gathering these materials, evaluating them, and making them accessible to those who need them. The unspoken premiss of the Worshop was: how do you go about gathering resources without patriotic help from a central government?

The implicit answers to the question were various. Not surprisingly, the more speakers your language has, the more options (and hence projects) there are likely to be.

Within the kingdom of Spain, Catalan and Galician are both regional languages, but they each have over 3 million speakers (Catalan, over 4 million): and they were represented by a variety of projects at the workshop, 6 for Catalan and 4 for Galician. (At this level of population they are only slightly less spoken than national languages such as Danish and Finnish, and much more than Icelandic's 230 thousand.) Telefonica, the Spanish telephone company had gathered Speech Data for them, since they were so widely spoken in Spain, but they were each supported for the gathering of textual resources by their regional governments, Generalitat de Catalunya and Xunta de Galicia.

At the next level down, languages with half a million speakers (Welsh, Basque, Breton) the situation is various: we heard of three projects for Basque and for Welsh, but only one for Breton. Basque has a speech database, and framework for textual processing at all levels from morphology to semantic networks. The Spanish and Basque Governments had provided some funding to the university teams involved, essentially through the general mechanism for supporting research. Welsh is pretty much on a par (with a phonetic DB supported by one of the UK's Research Councils), but European funding now was in evidence (via the SpeechDat II consortium)); a university department (Bangor) had various text processing resources to offer, but not a parser. Breton was the Cinderella of the trio, with only one project reported, KGB: this was centred very much about use of computers for education. It is hard to avoid the conclusion that linguistic resource development tends to correlate with the level of political autonomy of a language community.

Irish, with a state to represent it but only 260 thousand speakers, was present in two projects, both at the same institution (ITE in Dublin), and both supported by the European Union: the level of linguistic sophistication was also still fairly low, with a tagged corpus and an web-based association to pool data with other minority languages on offer.

At even lower levels of population, attention switches to basic questions of establishing a workable standard, or finding a way to manage the variation that was present in the corpus. So Ladin of the Dolomites (with 35 thousand) struggles to emerge from its fissiparous dialects, with the SPELL projectsupported by the EU. And the 150 speakers of Cornish still have to make a painful choice between three artifical standards derived from pre-modern texts: there was, for 150 years, no living tradition to take these decisions out of the hands of the individual. This appears to be the lonely effort of a single enthusiastic academic, without support.

The projects represented at the workshop were only from Western Europe: one might have hoped (though perhaps unrealistically) to see something of activities for Kashubian in Poland (200 thousand speakers) or Saami in Scandinavia (20 thousand). And even in Western Europe, the full variety was absent: no Scots Gaelic (89 thousand), Corsican (281 thousand in France), Sardinian (lively, but uncounted). Perhaps with such small numbers one expect only sporadic representation at any conference.

But we have already seen enough to see that in practice, central or regional government recognition is the key to real development of resources. Nevertheless the real cost of acquiring and preparing the data is falling all the time, as better equipped languages blaze the trails and metal the roads. And trans-national government, in the form of the European Union, can remedy some of the omissions of mean or un-self-confident nation states.

What is necessary is to follow one of these many excellent examples, and do the work. For in this field, at root, all languages are on a par. To apply a phrase of Rabbie Burns: a Tongue's a Tongue, for a' that.


Nicholas Ostler
Managing Director President
Linguacubun Ltd Foundation for Endangered Languages
Batheaston Villa, 172 Bailbrook Lane, Bath, BA1 7AA, England
Tel: +44-1225-85-2865
Fax: +44-1225-85-9258
Email: This e-mail address is being protected from spambots. You need JavaScript enabled to view it