College of Information Science and Engineering, Ritsumeikan University, Japan

Save Indonesian Ethnic Languages from Extinction

Semi-automatically create bilingual dictionaries among various Indonesian Ethnic Languages


Building dictionary development ecosystem for Low Resource Languages

Indonesia has a population of 221,398,286 and 707 living languages which cover 57.8% of Austronesian Family and 30.7% of languages in Asia [1]. There are 341 Indonesian ethnic languages facing a various degree of language endangerment (trouble / dying) where some of the native speakers do not speak Bahasa Indonesia well since they are in remote areas. Unfortunately, there are 13 Indonesian ethnic languages which already extinct. In order to save low-resource languages like Indonesian ethnic languages from language endangerment, we are trying to enrich the basic language resource, i.e., bilingual dictionary. Lately, low resource languages are getting more attention by UNESCO, ELRA, ACM, etc.


This is the statistic of languages, bilingual dictionaries and publications of this project.


bilingual dictionaries


Created Bilingual Dictionaries

In the first experiment (2016), we created all combinations of bilingual dictionaries from 5 languages (Indonesian, Malay, Minangkabau, Sundanese, and Javanese). In the second experiment (2019), we added Banjarese and Palembang to the family. At the end of 2020, we are planning to enrich bilingual dictionaries of the original 5 languages into 4000 translation pairs each as the third experiment.


Collaborate with us!

Research Collaboration

There are two factors we consider in selecting the target languages: language similarity and number of speakers. In order to ensure that the created bilingual dictionaries will be useful for many users, we listed the top 10 Indonesian ethnic languages ranked by the number of speakers and further select Javanese and Sundanese based on the number of speaker. To find and coordinate native speakers of those languages, we collaborated with Telkom University and University of Indonesia. Since our constraint-based approach works best on closely related language, we select Malay, Minangkabau, Palembang, and Banjarese based on relatedness with Indonesian. To find and coordinate native speakers of those language, we collaborated with Islamic University of Riau. Hence, we target 5 languages, i.e., Indonesian (ind), Malay (zlm), Minangkabau (min), Javanese (jav), and Sundanese (sun). 

Our Team

We work closely on computational linguistics, natural language processing, machine learning, and crowdsourcing approaches to enrich Indonesian Ethnic Languages.

Arbi Haza Nasution

Head of Department

Department of Informatics Engineering, Universitas Islam Riau, Indonesia



Yohei Murakami

Associate Professor

College of Information Science and Engineering, Ritsumeikan University, Japan



Toru Ishida


Global Center for Science and Engineering, Waseda University, Japan



Totok Suhardijanto

Head of Department

Department of Linguistics, University of Indonesia, Jakarta, Indonesia



Research Activities and Publications

Language Similarity Clustering

Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic[…]

Read more

Bilingual Dictionary Induction

The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful[…]

Read more

MDP Plan Optimizer

Creating bilingual dictionary is the first crucial step in enriching low-resource languages. Especially for the closely-related ones, it has been shown that the constraint-based approach is useful for inducing bilingual[…]

Read more