Download - ICT for Preserving Indigenous Languages - pustaka-sarawak.com · Universiti Malaysia Sarawak, Malaysia Pustaka Negeri Sarawak, Kuching 1/20. ICT for Preserving Indigenous Languages

ICT for Preserving Indigenous Languages

Sarah Samson Juan

Senior Lecturer, Faculty of Computer Science and Information Technology&

Research Fellow, Institute of Social Informatics and Technological InnovationsUniversiti Malaysia Sarawak, Malaysia

Pustaka Negeri Sarawak, Kuching

1 / 20


Research on Speech Technology

I Speech synthesisI Speech recognition

I Speaker recognition/verificationI Keyword spotting

I Multimodal interaction (e.g, speech + image)

I Speech to speech

2 / 20


Automatic Speech Recognition (ASR)

ASR applications

3 / 20


Introduction

Current situation for languages in Malaysia

Languages in Malaysia

Population: 30 millionOfficial language: MalaySecond language: English

Living languages

Total: 138

Endangered languages

In Trouble - 101Dying - 15

Extinct languages

Total: 2Lewis, Simons, and Fennig, Ethnologue : Languages of theworld, Seventh Edition, 2014

4 / 20


Introduction




Living languages

Total: 138



Extinct languages

Total: 2

Lewis, Simons, and Fennig, Ethnologue : Languages of theworld, Seventh Edition, 2014

4 / 20


Introduction




Living languages

Total: 138



Extinct languages

Total: 2Lewis, Simons, and Fennig, Ethnologue : Languages of theworld, Seventh Edition, 2014

4 / 20


Introduction


How can we help to preserve or maintain languages?

Language documentation:

I Speech in native language

I Problem: Transcribing speechmanually is a tedious task

I Automatic speech recognitionsystem can speed up theprocess

Similar projects: BULB, AikumaLocal RG: Sarawak LanguageTechnology (SaLT), Unimas

5 / 20


Introduction

Challenges in building ASR for under-resourced languages

Automatic speech recognition system (ASR)

Speech

Text

Speech transcript

Data for training

Acoustic modelling

Pronunciation modelling

Language modelling

Model training

Acoustic signal analyzer

Decoder

Speech recognizer

Text

Speech

Acoustic model

Pronunciation model

Languagemodel

Pronunciation lexicon

6 / 20


Introduction

Challenges in building ASR for under-resourced languages

ASR for under-resourced languages

Challenges in dealing with under-resourced languages:I Poor linguistic knowledge

I Unstable orthography

I Low speaker diversity inavailable speech databases

I Low amount of available data

I Low ASR performance

7 / 20


Introduction

Recent advances in ASR for under-resourced languages

Scientific methods in ASR for under-resourced languages

I Bootstrapping pronunciation dictionary ([Maskey, Black, andTomokiyo, 2004], [Juan and Besacier, 2013])

I Merging acoustic models ([Tan, Besacier, and Lecouteux,2014], [Juan et al., 2015])

I Cross-lingual and multilingual acoustic models ([Lu, Ghoshal,and Renals, 2014],[Imseng et al., 2014],[Juan et al., 2015])

8 / 20


Iban ASR: From collecting data to developing system

Iban data collection

Iban data - collected for PhD study

Speech data:

I 8 hours of news dataI Collaborative workshop for

collecting speechtranscripts

I Hire 8 nativetranscribers

I Use Transcriber software[Barras et al., 2000]

9 / 20




Speech transcripts

ibf 002 003 iya madah ka pen-

gawa tuk deka berengkah dik-

ereja enda lama agi

ibf 002 004 sebengkah kompeni

minyak ke nyulut royal dutch

shell deka begempung eng-

gau petrolium nasional berhad

petronas leboh ti bejalai ke dua

bengkah projek ngali minyak ba

kandang tasik sarawak enggau

sabah

ibf 002 005 tuai bagi pekara lng

royal dutch shell delareventer

madah ka projek tiga puluh

Audio files:

10 / 20




Iban data - collected for PhD study

Data for creating languagemodel and pronunciationdictionary:

I Online news articles

I Obtain 7 thousandarticles from2009-2012

I 2 million words

Figure: Iban pronunciation dictionary forASR

11 / 20



Iban corpora for ASR

Iban corpora for ASR

I Speech: 7 hours for training acoustic models, 1 hour forsystem evaluation

I Language model: 2 million words

I Pronunciation dictionary: 36 thousand pronunciations

I Open Source Toolkits for development: Kaldi1, SRILM2,Phonetisaurus3

1http://kaldi.sourceforge.net/2http://www.speech.sri.com/projects/srilm/3https://github.com/AdolfVonKleist/Phonetisaurus

12 / 20



Iban ASR system evaluation


Tested on Iban ASR

pehin sri taib madahka perintah besai udah mega ngemen-darka duit dua poin tiga biliun ringgit kena ngereja sekedaprojek di serata menua sarawak rambau menteri besai ti be-jalai kin kitu di menua sarawak dalam kandang tiga taun tuPlay file: ibf 001 014

13 / 20





Tested on Iban ASR

nyadi berikan tadi ditusun ramli haji junaidi ari berita rtm kuching lalu disalin raban jawahPlay file: ibm 005 171

14 / 20





Summary of Iban ASR results

System Accuracy (%)

Monolingual 81.25

Cross-lingual 84.85

Table: Evaluation on 1 hour data (473 sentences)

I More information in conference paper [Juan et al., 2015]

I ASR accuracy is still quite low

I Domain-specific system

15 / 20


Future Directions

Long term goal

Future Directions - Long term

Borneo Speech Corpus & Technologies

Partners:

16 / 20


Future Directions

Current research work

Ongoing projects

Target language Project

Melanau, Iban Corpus building for Multilingual ASR

Iban, Kelabit ASR prototypes and for mobile devices

Melanau Pronunciation dictionary for ASR

Iban Language modelling for low-resource language

17 / 20


Future Directions


Corpus building for Multilingual ASR

18 / 20


Future Directions


Corpus building for Multilingual ASR

19 / 20


Future Directions


KelaS: Kelabit Speech Project

20 / 20

References I

Barras, C. et al. (2000). “Transcriber: development and use of atool for assisting speech corpora production”. In: Proceedings ofSpeech Communication special issue on Speech Annotation andCorpus Tools. Vol. 33. available at :trans.sourceforge.net/en/publi.php.

Imseng, David et al. (2014). “Using out-of-language data toimprove under-resourced speech recognizer”. In: SpeechCommunication 56.0, pp. 142–151.

Juan, Sarah Samson and Laurent Besacier (2013). “FastBootstrapping of Grapheme to Phoneme System forUnder-resourced Languages - Application to the Iban Language”.In: Proceedings of 4th Workshop on South and Southeast AsianNatural Language Processing 2013. Nagoya, Japan.

References II

Juan, Sarah Samson et al. (2015a). “Merging of Native andNon-native Speech for Low-resource Accented ASR”. In: ed. byKlára Vicsi Adrian-Horia Dediu Carlos Martin-Vide. SpringerInternational Publishing. Chap. Statistical Language and SpeechProcessing, pp. 255–266.

Juan, Sarah Samson et al. (2015b). “Using Resources from aClosely-related Language to Develop ASR for a VeryUnder-resourced Language: A Case Study for Iban”. In:Proceedings of INTERSPEECH. To appear. Dresden, Germany.

Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (2014).Ethnologue : Languages of the world, Seventh Edition. SILInternational. url: http://www.ethnologue.com (visited on2013).

http://www.ethnologue.com

References III

Lu, Liang, Arnab Ghoshal, and Steve Renals (2014). “Cross-lingualSubspace Gaussian Mixture Models for Low-resource SpeechRecognition”. In: IEEE/ACM Transactions on Audio, Speechand Language Processing. Vol. 22, pp. 17–27.

Maskey, Sameer R., Alan W Black, and Laura M. Tomokiyo(2004). “Bootstrapping Phonetic Lexicons for Language”. In:Proceedings of INTERSPEECH, pp. 69–72.

Tan, Tien-Ping, Laurent Besacier, and Benjamin Lecouteux(2014). “Acoustic model Merging using Acoustic Models fromMultilingual Speakers for Automatic Speech Recognition”. In:Proceedings of International Conference on Asian LanguageProcessing (IALP).

IntroductionCurrent situation for languages in MalaysiaChallenges in building ASR for under-resourced languagesRecent advances in ASR for under-resourced languages

Iban ASR: From collecting data to developing systemIban data collectionIban corpora for ASRIban ASR system evaluation

Future DirectionsLong term goalCurrent research work

Appendix

fd@rm@0: fd@rm@1: