INDONESIAN LANGUAGE SPHERE PROJECT
Building dictionary development ecosystem for Low Resource Languages
Indonesia has a population of 221,398,286 and 707 living languages which cover 57.8% of Austronesian Family and 30.7% of languages in Asia [1]. There are 341 Indonesian ethnic languages facing a various degree of language endangerment (trouble / dying) where some of the native speakers do not speak Bahasa Indonesia well since they are in remote areas. Unfortunately, there are 13 Indonesian ethnic languages which already extinct. In order to save low-resource languages like Indonesian ethnic languages from language endangerment, we are trying to enrich the basic language resource, i.e., bilingual dictionary. Lately, low resource languages are getting more attention by UNESCO, ELRA, ACM, etc.



Created Bilingual Dictionaries
In the first experiment (2016), we created all combinations of bilingual dictionaries from 5 languages (Indonesian, Malay, Minangkabau, Sundanese, and Javanese). In the second experiment (2019), we added Banjarese and Palembang to the family. At the end of 2020, we are planning to enrich bilingual dictionaries of the original 5 languages into 4000 translation pairs each as the third experiment.
DICTIONARIESCollaborate with us!
International
Research Collaboration
There are two factors we consider in selecting the target languages: language similarity and number of speakers. In order to ensure that the created bilingual dictionaries will be useful for many users, we listed the top 10 Indonesian ethnic languages ranked by the number of speakers and further select Javanese and Sundanese based on the number of speaker. To find and coordinate native speakers of those languages, we collaborated with Telkom University and University of Indonesia. Since our constraint-based approach works best on closely related language, we select Malay, Minangkabau, Palembang, and Banjarese based on relatedness with Indonesian. To find and coordinate native speakers of those language, we collaborated with Islamic University of Riau. Hence, we target 5 languages, i.e., Indonesian (ind), Malay (zlm), Minangkabau (min), Javanese (jav), and Sundanese (sun).
Our Team
We work closely on computational linguistics, natural language processing, machine learning, and crowdsourcing approaches to enrich Indonesian Ethnic Languages.

Arbi Haza Nasution
Head of Department
Department of Informatics Engineering, Universitas Islam Riau, Indonesia
arbi[at]eng.uir.ac.id
website
Yohei Murakami
Associate Professor
College of Information Science and Engineering, Ritsumeikan University, Japan
yohei[at]fc.ritsumei.ac.jp
website
Toru Ishida
Professor
Global Center for Science and Engineering, Waseda University, Japan
toru.ishida[at]aoni.waseda.jp
website
Totok Suhardijanto
Head of Department
Department of Linguistics, University of Indonesia, Jakarta, Indonesia
totok.suhardijanto[at]ui.ac.id
websiteResearch Activities and Publications
Language Similarity Clustering
Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic[…]
Read moreBilingual Dictionary Induction
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful[…]
Read moreMDP Plan Optimizer
Creating bilingual dictionary is the first crucial step in enriching low-resource languages. Especially for the closely-related ones, it has been shown that the constraint-based approach is useful for inducing bilingual[…]
Read more