News

Report on the 9th SaLTMiL Workshop (Reykjavik, 2014)

Attention: open in a new window. PDFPrintE-mail

Report on the 9th SaLTMiL Workshop on “Free/open-Source Language Resources for the Machine Translation of Less-Resourced Languages
by Mikel L. Forcada
Tuesday, 27 May 2014.
Reykjavik (Iceland)
(
Proceedings)


The 9th SaLTMiL workshop on Free/open-source language resources for the machine translation of less-resourced languages, held as part of LREC 2014 in Reykjavík on May 27, 2014, from 09:30 to 13:30, was a very well attended event. About 40 people were present, more than the 31 attendees registered as of 22nd May, 2014.
After a brief welcoming address by Mikel Forcada, there were two oral sessions, interrupted by the coffee break. Both sessions ran very smoothly, with plenty of questions asked from the audience.

Iñaki Alegria, Unai Cabezon, Unai Fernandez de Betoño, Gorka Labaka, Aingeru
Mayor, Kepa Sarasola and Arkaitz Zubiaga

Wikipedia and Machine Translation: killing two birds with one stone

Gideon Kotzé and Friedel Wolff
Experiments with syllable-based English-Zulu alignment

In the second session, chaired by Trond Trosterud, the paper by Matthew Marting and Kevin Unhammer was presented by Francis Tyers as the authors could not make it to Iceland.

Inari Listenmaa and Kaarel Kaljurand
Computational Estonian Grammar in Grammatical Framework

Matthew Marting and Kevin Unhammer
FST Trimming: Ending Dictionary Redundancy in Apertium

Hrvoje Peradin, Filip Petkovski and Francis Tyers
Shallow-transfer rule-based machine translation for the Western group of South Slavic
languages

Alex Rudnick, Annette Rios Gonzales and Michael Gasser
Enhancing a Rule-Based MT System with Cross-Lingual WSD

The workshop ended with a 30-minute general discussion, moderated by Francis Tyers. Two main questions were posed to the audience:
  1. Is research in minority-language machine translation already mainstream
  2. What are the main difficulties in building or putting together free/open-source language resources for small languages, and how should they be addressed? Are we pooling these resources correctly?
The audience was also invited to openly discuss other issues if necessary. Here is a detailed summary of what was discussed.

Question (1) quickly turned into a discussion on research about rule-based systems.
Lori Levin said that minority languages are becoming mainstream and researchers are publishing in journal venue, that we need to educate people on the research issues related to rule-based language resources. She added that "It seems that tinkering with statistical models is research whereas tinkering with rules is not".
Robert Frederking: It is hard to publish papers on rule-based machine translation, at least in America. Francis Tyers replied that it may be easier in Europe, perhaps because European and American funding objectives are different.
Someone [not identified in Mikel Forcada's notes: apologies!] said that linguistic research may help facing some issues. Lori Levin said that the problem is partly linguistic and partly not, and the key is where to spend time in rule-based machine translation for maximum impact.
Francis Tyers mentions that the statistical machine translation community is strong partly because they use standardized evaluation measures. Antonio Toral mentioned that most papers in machine translation conferences are on statistical machine translation, and improvements reported are usually less than one BLEU point, but added that, in view of the results of a workshop on machine translation evaluation held the day before, there is very little correlation between BLEU score and productivity gains. Maja Popović added that there is a tendency in statistical machine translation towards morphologically rich, under-resourced languages. Trond is sceptical about research that does not take into account existing morphologies for these languages or does not aim at developing them. Maja Popović adds that it is better to have something than nothing and that all knowledge should be combined.The role of linguists and their involvement is also discussed. Jonathan Washington explains his experience as a linguist getting involved in morphologies for Turkic languages and the issues faced. Lori Levin says that we should get more linguists involved, but that rule writing is a skill that not everyone has; many people think they can, but we should educate linguists to be rule writers. Also, on interaction between linguists and prominent statistical machine translation researchers, she says that they are very busy people and that it is quite hard to get into their schedule to discuss these issues.
Laurette Pretorius speaks from a South African context. She says that computational linguistics is not taught in South African Universities, and stresses the importance of collaboration. She says that linguists should not be assigned the boring tasks, such as annotation tasks, but that they should be involved in the whole design. Francis Tyers also talks about the choice between doing tedious annotation or more interesting rule writing and understand that people would rather prefer the second. Mikel Forcada warns about the fact that linguists tend to get carried away by the low-frequency "jewels" of their languages and lose sight of the high-frequency "building blocks" needed for working systems.
Christian Buck returns to the fact that the statistical machine translation community has yearly "shoot-outs" (contests) where they can test their advancement, and that these contests do drive their research.
Jonathan Washington mentions that the computational perspective made him rethink many of the issues relating Turkic languages. Mikel mentions that in fact, computational linguistics descriptions are the best descriptions of language sometimes, and mentions the IXA Group's "computational morphology of Basque" or Elaine Uí Dhonnchadha's Irish morphology as the best description of their languages' morphologies.
Lori Levin stresses the fact that linguists have to be trained to do the linguistic engineering. For instance, lexical-functional grammars may teach aspects such as modularity.
Mikel L. Forcada mentions two problems in rule-based research: one, that rule-based MT as a field is very fragmented after the pervasive irruption of statistical machine translation, and, as a result, we do not speak with one voice and use inconsistent terminologies which make it very difficult to articulate ourselves as a field. Another one is reproducibility: for our rule-based research to be reproducible we have to make it all available, and this naturally leads to free/open-source licensing.
Sjur Moshagen talks about the fact that resources should be reusable in other language technologies and explains that in Tromsø they had no option as they were building the only set of resources for Sámi languages and they had to be reusable, and that linguists had to be trained in engineering issues. In fact, one of the uses was the Apertium machine translation systems for Sámi languages.

Sjur Moshagen opens question (2) about pooling and resource sharing and asks where would be a good place to take all these resources.
On reproducibility, it is discussed that commercial rule-based systems often do not even want to be mentioned by name and ask to be called "System A" or "System B" in contests. Some judge this to be unfortunate.
Mikel Forcada mentions that pooling should pay special attention to metadata describing how to use the resources. He says that this is a more difficult problem than licensing.
Friedel Wolff questions whether licensing is really an easy problem and talks about incompatibilities among licenses such as Creative Commons and the General Public License, and says that when it comes to license derived data, decisions may be far from being trivial.

The discussion stops here and moderator Francis Tyers thanks everyone for the rich discussion. Mikel L. Forcada thanks everyone for attending and closes the Workshop.
 

Programme of the 9th SaLTMiL Workshop on “Free/open-Source Language Resources for the Machine Translation of Less-Resourced Languages”

Attention: open in a new window. PDFPrintE-mail

A half-day workshop at LREC 2014
Tuesday, 27 May 2014.
Reykjavik (Iceland)

SALTMIL: http://ixa2.si.ehu.es/saltmil/
LREC 2014: http://lrec2014.lrec-conf.org/en/
Website: http://ixa2.si.ehu.es/saltmil/


Registration: http://lrec2014.lrec-conf.org/en/registration/

 

Workshop Programme
(
Proceedings)

09:00 – 09:30 Welcoming address by Workshop co-chair Mikel L. Forcada

09:30 – 10:30 Oral papers

Iñaki Alegria, Unai Cabezon, Unai Fernandez de Betoño, Gorka Labaka, Aingeru
Mayor, Kepa Sarasola and Arkaitz Zubiaga

Wikipedia and Machine Translation: killing two birds with one stone

Gideon Kotzé and Friedel Wolff
Experiments with syllable-based English-Zulu alignment

10:30 – 11:00 Coffee break

11:00 – 13:00 Oral papers

Inari Listenmaa and Kaarel Kaljurand
Computational Estonian Grammar in Grammatical Framework

Matthew Marting and Kevin Unhammer
FST Trimming: Ending Dictionary Redundancy in Apertium

Hrvoje Peradin, Filip Petkovski and Francis Tyers
Shallow-transfer rule-based machine translation for the Western group of South Slavic
languages

Alex Rudnick, Annette Rios Gonzales and Michael Gasser
Enhancing a Rule-Based MT System with Cross-Lingual WSD

13:00 – 13:30 General discussion

13:30 Closing

Workshop Organizers

Mikel L. Forcada Universitat d’Alacant, Spain
Kepa Sarasola. Euskal Herriko Unibertsitatea, Spain
Francis M. Tyers. UiT Norgga árktalaš universitehta, Norway

Workshop Programme Committee

Iñaki Alegria Euskal Herriko Unibertsitatea, Spain
Lars Borin Göteborgs Universitet, Sweden
Elaine Uí Dhonnchadha Trinity College Dublin, Ireland
Mikel L. Forcada Universitat d’Alacant, Spain
Michael Gasser Indiana University, USA
Måns Huldén Helsingin Yliopisto, Finland
Krister Lindén Helsingin Yliopisto, Finland
Nikola Ljubešić Sveučilište u Zagrebu, Croatia
Lluís Padró Universitat Politècnica de Catalunya, Spain
Juan Antonio Pérez-Ortiz Universitat d’Alacant, Spain
Felipe Sánchez-Martínez Universitat d’Alacant, Spain
Kepa Sarasola Euskal Herriko Unibertsitatea, Spain
Kevin P. Scannell Saint Louis University, USA
Antonio Toral Dublin City University, Ireland
Trond Trosterud UiT Norgga árktalaš universitehta, Norway
Francis M. Tyers UiT Norgga árktalaš universitehta, Norway

Introduction

The 9th International Workshop of the Special Interest Group on Speech and Language Technology for Minority Languages (SaLTMiL) will be held in Reykjavík, Iceland, on 27th May 2014, as part of the 2014 International Language Resources and Evaluation Conference (LREC). (For SALTMIL see: http://ixa2.si.ehu.es/saltmil/); it is also framed as one of the activities of European project Abu-Matran (http://www.abumatran.eu). Entitled "Free/open-source language resources for the machine translation of less-resourced languages", the workshop is intended to continue the series of SALTMIL/LREC workshops on computational language resources for minority languages, held in Granada (1998), Athens (2000), Las Palmas de Gran Canaria (2002), Lisbon (2004), Genoa (2006), Marrakech (2008), La Valetta (2010) and Istanbul (2012), and is also expected to attract the audience of Free Rule-Based Machine Translation workshops (2009, 2011, 2012).

The workshop aims to share information on language resources, tools and best practice, to save isolated researchers from starting from scratch when building machine translation for a less-resourced language. An important aspect will be the strengthening of the free/open-source language resources community, which can minimize duplication of effort and optimize development and adoption, in line with the LREC 2014 hot topic ‘LRs in the Collaborative Age’ (http://is.gd/LREChot).

Papers describe research and development in the following areas:

  • Free/open-source language resources for rule-based machine translation (dictionaries, rule sets)
  • Free/open-source language resources for statistical machine translation (corpora)
  • Free/open-source tools to annotate, clean, preprocess, convert, etc. language resources for machine translation
  • Machine translation as a tool for creating or enriching free/open-source language resources for less-resourced languages

 

   

Call for Papers: 9th SaLTMiL workshop on “Free/open-source language resources for the machine translation of less-resourced languages” at LREC 2014.

Attention: open in a new window. PDFPrintE-mail

A full-day workshop at LREC 2014
Tuesday, 27 May 2014.
Reykjavik (Iceland)

SALTMIL: http://ixa2.si.ehu.es/saltmil/
LREC 2014: http://lrec2014.lrec-conf.org/en/
Website: http://ixa2.si.ehu.es/saltmil/
Paper submission: https://www.softconf.com/lrec2014/SaLTMiL/

The 9th International Workshop of the Special Interest Group on Speech and Language Technology for Minority Languages (SaLTMiL) will be held in Reykjavík, Iceland, on May 24, 2014, as part of the 2014 International Language Resources and Evaluation Conference (LREC). (For SALTMIL see: http://ixa2.si.ehu.es/saltmil/); it is also framed as one of the activities of European project Abu-Matran (http://www.abumatran.eu). Entitled "Free/open-source language resources for the machine translation of less-resourced languages", the workshop is intended to continue the series of SALTMIL/LREC workshops on computational language resources for minority languages, held in Granada (1998), Athens (2000), Las Palmas de Gran Canaria (2002), Lisbon (2004), Genoa (2006), Marrakech (2008), La Valetta (2010) and Istanbul (2012), and is also expected to attract the audience of Free Rule-Based Machine Translation workshops (2009, 2011, 2012). The workshop aims to share information on language resources, tools and best practice, to save isolated researchers from starting from scratch when building machine translation for a less-resourced language. An important aspect will be the strengthening of the free/open-source language resources community, which can minimize duplication of effort and optimize development and adoption, in line with the LREC 2014 hot topic ‘LRs in the Collaborative Age’ (http://is.gd/LREChot).

The whole-day workshop will consist of short oral papers, a poster session preceded by a poster-boaster session (2 minutes, 2 slides per poster), and a round table.

Papers are invited that describe research and development in the following areas:

  • FOS LR for rule-based machine translation (dictionaries, rule sets)
  • FOS LR for statistical machine translation (corpora)
  • FOS tools to annotate, clean, preprocess, convert, etc. LRs for machine translation
  • Machine translation as a tool for creating or enriching FOS LRs for less-resourced languages

Position papers and (web based) demonstrations will also be considered for presentation.

The best papers, as evaluated by the programme committee, will be presented orally and the remaining paper will be presented in poster format.

We expect short papers of max 6,000 words (up to 6 pages) describing research addressing one of the above topics, to be submitted as PDF documents by using the LREC 2014 START conference management system (https://www.softconf.com/lrec2014/SaLTMiL/).

Submissions should be anonymized. When submitting a paper through the START page, authors will be kindly asked to share the resources that have been used for the work described in their paper or that are the outcome of their research. For further information on this initiative, please refer to http://lrec2014.lrec-conf.org/en/calls-for-papers/lrec-2014-special-highlight/.

Submissions of papers should follow the same style as the papers for the main LREC conference (an Author's Kit made of specific guidelines and downloadable templates will be published on the conference web site in due time). All contributions will be included in the workshop proceedings (CD). They will also be published on the SALTMIL website.

The registration fees will be duly announced at the LREC 2014 site. Registration in the workshop willl include a coffee break and the Proceedings of the Workshop. Registration will be handled by the LREC 2014 Secretariat.

 

Important dates

Deadline for paper submission: February 10 2014 February 17, 2014
Notification of acceptance sent: March 3, 2014 March, 10, 2014 March, 14, 2014
Camera-ready paper due: March 21, 2014

Organizing committee

Joint e-mail address: This e-mail address is being protected from spambots. You need JavaScript enabled to view it

(1) Dr Francis M Tyers
Institutt for språkvitskap
Det humanistiske fakultet,
N-9037 Universitetet i Tromsø
This e-mail address is being protected from spambots. You need JavaScript enabled to view it

(2) Dr Kepa Sarasola
Computer Science Faculty
Dept. of Computer Languages
The University of the Basque Country
P.K. 649 20080 DONOSTIA
Basque Country, Spain
Tel: +34 943 01 81 54
Fax: +34 943 21 93 06
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
http://ixa.si.ehu.es

(3) Prof Mikel L. Forcada
Dept. Llenguatges i Sistemes informàtics
Universitat d’Alacant
E-03071 Alacant (Spain)
Tel: +34 96 590 9776
FAx: +34 96 590 9326
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
http://www.dlsi.ua.es/~mlf

 

Programme Committee

Iñaki Alegria, Euskal Herriko Unibertsitatea, Spain
Lars Borin, Göteborgs Universitet, Sweden.
Elaine Uí Dhonnchadha, Trinity College Dublin, Ireland
Mikel L. Forcada, Universitat d’Alacant, Spain
Michael Gasser, Indiana University, USA
Måns Huldén, Helsingin Yliopisto, Finland
Krister Lindén, Helsingin Yliopisto, Finland
Nikola Ljubešić, Sveučilište u Zagrebu, Croatia
Lluís Padró, Universitat Politècnica de Catalunya, Spain
Juan Antonio Pérez-Ortiz, Universitat d’Alacant, Spain
Felipe Sánchez-Martínez, Universitat d’Alacant
Kepa Sarasola, Euskal Herriko Unibertsitatea, Spain
Kevin P. Scannell, Saint Louis University, USA
Antonio Toral, Dublin City University, Ireland
Trond Trosterud, Universitet i Tromsø, Norway
Francis M. Tyers, Universitet i Tromsø, Norway

 

   

Report on the 8th LREC workshop. 2012

Attention: open in a new window. PDFPrintE-mail

On May 22nd 2012, SALTMIL held in collaboration with AfLaT a full-day workshop on "Language technology for normalisation of less-resourced languages". This was a satellite workshop preceding the biennial LREC (Language Resources and Evaluation Conference) in Istanbul, Turkey.

The program started with the invited talk presented by Sjur Moshagen Nørstebø. This was then followed by two sessions of four oral presentations and a poster session with eight contributed poster papers. All the presentations and posters stimulated many questions and discussions.

At 17.30, after a brief presentation by Francys Tyers and Guy De Pauw, an interesting  discussion took place on "Language technology for normalisation of less-resourced languages" and then the workshop was closed by thanking the audience for their participation in the whole workshop.

About fourty five people were present in total, from a wide range of countries, and representing work on a variety of less resourced languages.

Addional materials related to this workshop are available:

  • Proceedings (pdf)
  • Slides of Sjur Moshagen Nørstebø's invited talk (pdf)
  • Posters and slides of some oral presentations. (zip)
   

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”

Attention: open in a new window. PDFPrintE-mail


A full-day workshop at LREC 2012
Tuesday, 22 May 2012.
Lütfi Kirdar Istanbul Exhibition and Congress Centre, Istanbul, Turkey

SALTMIL: http://ixa2.si.ehu.es/saltmil/
AfLaT: http://AfLaT.org/
LREC 2012: http://www.lrec-conf.org/lrec2012/

WORKSHOP PROGRAMME

09:15–09:30 Welcome / Opening Session
09:30–10:30 Invited Talk: Sjur Moshagen Nørstebø. How to build language technology resources for the next 100 years
10:30–11:00 Coffee Break
11:00–13:00 Oral papers: Resource Creation

  • Elaine Uí Dhonnchadha, Alessio Frenda and Brian Vaughan, Issues in Designing a Spoken Corpus of Irish.
  • Wondwossen Mulugeta and Michael Gasser, Learning Morphological Rules for Amharic Verbs Using Inductive LogicProgramming
  • Krist ́n Bjarnadottir, The Database of Modern Icelandic Inflection
  • Fadoua Ataa Allah and Siham Boulaknadel, Natural Language Processing for Amazigh Language: Challenges and Future Directions

13:00–14:00 Lunch Break
14:00–16:00 Oral papers: Resource Use

  • Tommi A. Pirinen and Francis M. Tyers. Compiling Apertium morphological dictionaries with HFST and using them in HFST applications.
  • Borbóla Siklósi, György Orosz, Attila Novák and Gábor Prószéky. Automatic structuring and correction suggestion system for Hungarian clinical records.
  • Linda Wiechetek. Constraint Grammar based Correction of Grammatical Errors for North Sàmi.
  • Michael Gasser, Toward a Rule-Based System for English-Amharic Translation.

16:00–16:30   Coffee Break
16:30–17:30   Poster Session

  • Emmanuel Cartier and Paola Carrion Gonzalez, Technological Tools for Dictionary and Corpora Building for Minority Languages: Example of the French-based Creoles.
  • Denys Duchier, Brunelle Magnana Ekoukou, Yannick Parmentier, Simon Petitjean and Emannuel Schang, Describing Morphologically-rich Languages using Metagrammars: a Look at Verbs in Ikota.
  • Tjerk Hagemeijer, Iris Hendrickx, Abigail Tiny and Haldane Amaro, A Corpus of Santomé.
  • Sigrún Helgad ́ ttir, Asta Svavarsdóttir, Eiríkur Rögnvaldsson, Kristín Bjarnadóttir and Hrafn Loftsson, The Tagged Icelandic Corpus (MM).
  • Laurette Pretorius and Sonja Bosch, Semi-automated extraction of morphological grammars for Nguni with special reference to Southern Ndebele.
  • Björn Gambäck, Tagging and Verifying an Amharic News Corpus.
  • Guy De Pauw, Gilles-Maurice de Schryver and Janneke van de Loo. Resource-Light Bantu Part-of-Speech Tagging.
  • Gulshan Dovudov, Vít Suchomel and Pavel Smerk, POS Annotated 50M Corpus of Tajik Language.


CONTEXT AND FOCUS

The 8th International Workshop of the ISCA Special Interest Group on Speech and Language Technology for Minority Languages (SALTMIL, http://ixa2.si.ehu.es/saltmil) and the 4th Workshop on African Language Technology (AfLaT2012) will be held as a joint effort in Istanbul, in May 2012, as part of the 2012 International Language Resources and Evaluation Conference (LREC 2012).

Entitled "Language technology for normalisation of less-resourced languages", the workshop is intended to continue the series of SALTMIL/LREC workshops on computational language resources for minority languages, held in Granada (1998), Athens (2000), Las Palmas de Gran Canaria (2002) and Lisbon (2004), Genoa (2006), Marrakech (2008) and Malta (2010) and the series of AfLaT workshops, held in Athens (EACL2009), Malta (LREC2010) and Addis Ababa (AGIS11).

The Istanbul 2012 workshop aims to share information on tools and best practices, so that isolated researchers will not need to start from scratch. An important aspect will be the forming of personal contacts, which can minimize duplication of effort. There will be a balance between presentations of existing language resources, and more general presentations designed to give background information needed by all researchers.

While less-resourced languages and minority languages often struggle to find their place in a digital world dominated by only a handful of commercially interesting languages, a growing number of researchers are working on alleviating this linguistic digital divide, through localisation efforts, the development of BLARKs (basic language resource kits) and practical applications of human language technologies. The joint SALTMIL/AfLaT workshop on "Language technology for normalisation of less-resourced languages" provides a unique opportunity to connect these researchers and set up a common forum to meet and share the latest developments in the field.

ORGANIZERS (SALTMIL and AfLaT)

* Mikel L. Forcada (SALTMIL): Machine Translation Group, School of Computing, Dublin City University, Dublin, Ireland
* Guy De Pauw (AfLaT): CLiPS - Computational Linguistics Group, University of Antwerp, Antwerp, Belgium
* Gilles-Maurice de Schryver(AfLaT): African Languages and Cultures, TshwaneDJe HLT, South Africa & Ghent University, Belgium
* Kepa Sarasola(SALTMIL): Dept. of Computer Languages, University of the Basque Country
* Francis M. Tyers(SALTMIL), Departament de Llenguatges i Sistemes Informàtics, Universitat d'Alacant, Spain
* Peter Waiganjo Wagacha(AfLaT): School of Computing & Informatics, University of Nairobi, Nairobi, Kenya

PROGRAMME COMMITTEE

* Iñaki Alegria: University of the Basque Country
* Núria Bel, Universitat Pompeu Fabra, Barcelona, Spain
* Lars Borin, Göteborgs universitet, Sweden
* Sonja Bosch, University of South Africa, South Africa
* Khalid Choukri (ELRA,ELDA, France)
* Mikel L. Forcada, Universitat d’Alacant
* Dafydd Gibbon, University of Bielefeld, Germany
* Girish Nath Jha, Jawaharlal Nehru University, India
* Hrafn Loftsson,  Reykjavik University
* Guy De Pauw, CLiPS, Universiteit Antwerpen
* Laurette Pretorius, University of South Africa, South Africa
* Lori Levin, Carnegie Mellon University, USA
* Odetunji Odejobi, Obafemi Awolowo University, Nigeria
* Benoît Sagot, INRIA Paris Rocquencourt & Université Paris 7, France
* Felipe Sánchez-Martínez, Universitat d'Alacant
* Kepa Sarasola, University of the Basque Country
* Kevin Scannell, Saint Louis University, USA
* Gilles-Maurice de Schryver, Universiteit Gent
* Trond Trosterud, Universitetet i Tromsø, Norway
* Francis M. Tyers, Universitat d'Alacant
* Peter Waiganjo Wagacha, University of Nairobi


REGISTRATION

See Registration in LREC 2012 site

   

Page 1 of 3