Indonesian machine translation papers

2010
 * I/ETS: Indonesian-English Machine Translation System using Collaborative P2P Corpus (Hammam Riza, Budiono, Adiansya Prasetya and Henky Mulyadi)


 * Indonesian ⇔ English
 * Hybrid symbolic statistical
 * ​currently fully statistical
 * ​future symbolic module
 * ​morphological analyzer
 * phrase reordering system
 * ​buku biru → blue book
 * ​generation system
 * ​SRILM
 * ​translation and language model
 * ​PHARAOH
 * ​beam search decoder
 * ​Evaluation
 * ​240,000 training sentences
 * 10,000 test sentences
 * ​2 reference translations
 * BLEU score 0.649
 * "No large corpus available"
 * ​bilingual Indonesian-English corpus construction
 * ​collaborative
 * ​contributors
 * ​news agencies
 * journalists
 * reporters
 * bloggers
 * media publishers
 * classification
 * domain
 * national
 * international
 * sport
 * economy
 * science technology
 * genre
 * ​news
 * article
 * ​target
 * ​1 million sentence pair
 * currently ​282,000
 * 20 million words
 * ​currently English 4.67 million, Indonesian 4.32 million

2009


 * Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System (Budiono, Hammam Riza, Chairil Hakim)
 * ANTARA News
 * International news agency
 * News articles in Indonesia and English
 * comparable corpus
 * but articles are unaligned in raw data
 * 2000-2007
 * 250,000 stentence pairs
 * 2.5 million words
 * SMT system test
 * MOSES
 * 1 million words training set
 * BLEU socore 0.76
 * BPPT-PANL
 * 500,000 words
 * steps
 * indonesian corpus gathering
 * creative commons online source
 * translation of indonesian corpus
 * syntatic transfer enforced
 * alignment
 * sentence level
 * 2 aligners
 * resolving issues
 * tagging of corpus using XML
 * Text Encoding Initiative
 * BTEC-ATR (basic travel expression corpus)
 * source: monolingual english (owned by NICT-ATR Japan)
 * 153,000 sentences
 * translation into Indonesian
 * tagging
 * POS tagging
 * syllabification
 * word stress
 * INC-IX
 * 100,000 sentences
 * parliament report (BPPT)
 * Steps
 * Collection of Indonesian corpus
 * Cleaning
 * Translation
 * Alignment
 * manually while translating
 * XML tagging
 * Domain
 * national
 * internation
 * business/economy
 * politics
 * science
 * technology
 * sport
 * source
 * news agency
 * online publisher
 * international institution

2008 sentence ​criteria: phonetically balanced sentences
 * Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project (Sakriani Sakti, Eka Kelana, Hammam Riza, Shinsuke Sakai, Konstantin Markov, Satoshi Nakamura)
 * large vocabulary continuous speech recognition (LVCSR)
 * resource issues
 * types
 * daily news
 * R&D Telkom with ATR
 * raw data cleaning
 * Raw data: Tala, 2003 (student)
 * ​Kompas, Tempo
 * > 3,160 articles
 * about 600,000 sentences
 * ​Voice recording
 * 400 people (each 110 sentences, totaling 44,000 utterances)
 * raw data cleaning
 * Raw data: Tala, 2003 (student)
 * ​Kompas, Tempo
 * > 3,160 articles
 * about 600,000 sentences
 * ​Voice recording
 * 400 people (each 110 sentences, totaling 44,000 utterances)
 * about 600,000 sentences
 * ​Voice recording
 * 400 people (each 110 sentences, totaling 44,000 utterances)
 * ​Voice recording
 * 400 people (each 110 sentences, totaling 44,000 utterances)
 * 400 people (each 110 sentences, totaling 44,000 utterances)
 * 400 people (each 110 sentences, totaling 44,000 utterances)
 * 400 people (each 110 sentences, totaling 44,000 utterances)


 * ​greedy search
 * ​selected 3,168 sentences
 * ​place: Bandung
 * ​clean and telephone version
 * ​speaker criteria
 * ​accent
 * ​batak
 * javanece
 * sundanese
 * no accent (standard indonesian)
 * ​age and gender distributed
 * telephone application
 * R&D Telkom with ATR
 * about 2,500 sentences
 * ​dialogs from telephone services
 * ​tele-home security
 * billing information services
 * reservation services
 * status tracking of e-Government services
 * hearing impaired telecommunications services
 * ​recording: same as above
 * BTEC (Basic Travel Expression Corpus ) tasks
 * ATR with BPPT
 * original: Japanese/English sentence pairs
 * Indonesian translation
 * training set
 * 160,000 sentences
 * 20,000 unique words
 * test set
 * 510 sentences
 * 16 references/sentence
 * recording
 * 510 sentences
 * 42 speakers, each recording all sentences
 * 21,420 utterances
 * form
 * text
 * speech
 * speech recognizer
 * ATR speech recognition engine
 * training data: daily news and telephone application tasks
 * evaluation data: BTEC task
 * acoustic modeling
 * language modeling
 * lexicon
 * accuracy
 * 92.47% on BTEC
 * Asian speech translation (A-STAR)
 * facilitates
 * multilingual speech translation
 * multilingual speech transcription
 * multilingual information retrieval
 * coordinator
 * Advanced Telecommunication Research (ATR)
 * National Institute of Information and Commnications Technology (NICT) Japan
 * National Laboratory of Pattern Recognition (NLPR) China
 * Electronics and Telecommunication Research Institute (ETRI) Korea
 * Agency for the Assessment and Application Technology (BPPT) Indonesia
 * National Electronics and Computer Technology Center (NECTEC) Thailand
 * Center for Development of Advanced Computing (CDAC) India
 * National Taiwan University (NTU) Taiwan.
 * Indonesian language
 * "a member of the agglutinative language family, meaning that it has a complex range of preﬁxes and sufﬁxes, which are attached to base words" ← really?!?
 * phonology
 * 33 phoneme symbols
 * 10 vowels (including diphthongs like pant ai )
 * /a/, /i/, /u/, /e/ (merah), /e2/ (bekas), /o/
 * /ay/ (pantai)
 * /aw/ (kalau)
 * /oy/ (asoy)
 * /ey/ (ambeien)
 * 22 consonants
 * 1 silent symbol
 * text
 * speech
 * speech recognizer
 * ATR speech recognition engine
 * training data: daily news and telephone application tasks
 * evaluation data: BTEC task
 * acoustic modeling
 * language modeling
 * lexicon
 * accuracy
 * 92.47% on BTEC
 * Asian speech translation (A-STAR)
 * facilitates
 * multilingual speech translation
 * multilingual speech transcription
 * multilingual information retrieval
 * coordinator
 * Advanced Telecommunication Research (ATR)
 * National Institute of Information and Commnications Technology (NICT) Japan
 * National Laboratory of Pattern Recognition (NLPR) China
 * Electronics and Telecommunication Research Institute (ETRI) Korea
 * Agency for the Assessment and Application Technology (BPPT) Indonesia
 * National Electronics and Computer Technology Center (NECTEC) Thailand
 * Center for Development of Advanced Computing (CDAC) India
 * National Taiwan University (NTU) Taiwan.
 * Indonesian language
 * "a member of the agglutinative language family, meaning that it has a complex range of preﬁxes and sufﬁxes, which are attached to base words" ← really?!?
 * phonology
 * 33 phoneme symbols
 * 10 vowels (including diphthongs like pant ai )
 * /a/, /i/, /u/, /e/ (merah), /e2/ (bekas), /o/
 * /ay/ (pantai)
 * /aw/ (kalau)
 * /oy/ (asoy)
 * /ey/ (ambeien)
 * 22 consonants
 * 1 silent symbol
 * National Institute of Information and Commnications Technology (NICT) Japan
 * National Laboratory of Pattern Recognition (NLPR) China
 * Electronics and Telecommunication Research Institute (ETRI) Korea
 * Agency for the Assessment and Application Technology (BPPT) Indonesia
 * National Electronics and Computer Technology Center (NECTEC) Thailand
 * Center for Development of Advanced Computing (CDAC) India
 * National Taiwan University (NTU) Taiwan.
 * Indonesian language
 * "a member of the agglutinative language family, meaning that it has a complex range of preﬁxes and sufﬁxes, which are attached to base words" ← really?!?
 * phonology
 * 33 phoneme symbols
 * 10 vowels (including diphthongs like pant ai )
 * /a/, /i/, /u/, /e/ (merah), /e2/ (bekas), /o/
 * /ay/ (pantai)
 * /aw/ (kalau)
 * /oy/ (asoy)
 * /ey/ (ambeien)
 * 22 consonants
 * 1 silent symbol
 * 33 phoneme symbols
 * 10 vowels (including diphthongs like pant ai )
 * /a/, /i/, /u/, /e/ (merah), /e2/ (bekas), /o/
 * /ay/ (pantai)
 * /aw/ (kalau)
 * /oy/ (asoy)
 * /ey/ (ambeien)
 * 22 consonants
 * 1 silent symbol
 * /aw/ (kalau)
 * /oy/ (asoy)
 * /ey/ (ambeien)
 * 22 consonants
 * 1 silent symbol
 * 22 consonants
 * 1 silent symbol
 * 1 silent symbol
 * 1 silent symbol


 * Machine Translation for Indonesian and Tagalog (Brianna Laugher and Ben MacLeod): http://www.amtaweb.org/papers/4.20_LaugherMacLeod2008.pdf
 * "no tagged or parallel corpora (Indonesian-English)"

2007

Developing Cross Language Systems for Language Pair with Limited Resource -Indonesian-Japanese CLIR and CLQA- (Ayu Purwarianti)
 * Indonesian-Japanese CLIR
 * transitive translation/direct-transitive translation
 * source: Indonesian (minimum data resource)
 * target: Japanese
 * pivot: English
 * direct translation (indonesian to japanese, english to japanese) on direct-transitive translation method is used for words not in dictionary of transitive method
 * steps
 * keyword translations
 * select best available translations
 * mutual information
 * tf idf
 * result (NTCIR 3 web retrieval)
 * transitive: 38% of monolingual
 * hybrid: 49% of monolingual
 * higher than Kataku (Indonesian-English) + Babelfish/Excite (English-Japanese)
 * comparable to English-Japanese IR task
 * Indonesian monolingual QA
 * machine learning
 * components
 * question classifier
 * answer finder
 * developed tools
 * POS taggler
 * shallow parser for question
 * Training data
 * 18 indonesians
 * 3000 questions
 * 6 answer types
 * evaluation
 * 71,109 Indonesian news articles
 * question classification: 96% accuracy
 * answer finder: MRR 0.52 on first answer (Mean Reciprocal rank)
 * features for SVM
 * bi-gram frequency between the intended word and some defined words
 * to cope with the resource poorness
 * Indonesian-English CLQA
 * question: Indonesian
 * documents: English
 * built from Indonesian QA system
 * Indonesian-English using bilingual dictionary
 * English passage retriever
 * English answer finder
 * evaluation
 * in-house test data
 * NTCIR 2005 CLQA task (translated into Indonesian)
 * only defeated by top results which uses high quality dictionaries
 * training data
 * questions written by Indonesian students based on English articles
 * Indonesian-Japanese CLQA
 * transitive translation
 * transitive passage retrieval
 * English as pivot

1993

Multi-Lingual Machine Translation (MMT) Project (Susumu Funaki)
 * Center of the International Cooperation for Computerization (CICC)
 * June 1, 1983
 * MITI and Japanese computer industry
 * Word processor research
 * Machine translation for Asian languages
 * http://www.cicc.or.jp/japanese/kyoudou/mt.html
 * 1987: The Research and Development Cooperation Project on a Machine Translation System for Japan and its Neighbouring Countries
 * Machine Translation System Laboratory
 * first 2 years: basic study and research
 * 3rd year: full scale development
 * last 2 years: system improvement
 * Interlingua
 * Languages
 * Chinese
 * Indonesian
 * Malaysian
 * Thai
 * Japanese
 * Target
 * 5,000 words per hour
 * 80 to 90 percent accurate
 * condition
 * original sentences are gramatically correct
 * all the words are in the dictionary
 * Subsystem
 * Input system
 * word processor
 * OCR
 * operator can pre-edit sentences so that machine can translate it easily
 * Sentence analysis
 * convert to interlingua
 * morphological analysis
 * syntax analysis
 * semantic analysis
 * resource
 * electronic dictionary
 * basic dictionary: 50,000 terms
 * information processing technical term dictionary: 25,000 terms
 * sentence analysis rules
 * Sentence generation system
 * sentence style generation
 * syntax generation
 * morphological generation
 * Output system and translation support system
 * final edit by operator
 * Integrated system integrating all of the above

1990

CICCインドネシア語翻訳システムの構成
 * 辞書の構成
 * 派生語を語幹でまとめる
 * satu (語幹) → satu, menyatukan, satuan, persatuan, ...
 * 幅広い適用性を考慮
 * 実際の翻訳の利用
 * 1派生語が1レコード
 * 各単語に付く情報
 * 品詞
 * 意味情報(概念番号): 中間言語の辞書への関連付け
 * 一つの概念は1語で表せない場合は複合語として記述する
 * イ語の固有の概念は、新たに概念辞書に登録する
 * 構文情報
 * 品詞詳細情報: 品詞ごとにより詳細な文法情報
 * 文系情報: 述語のとる構文パターン
 * 形態素情報
 * 活用情報: 能動形・受動形
 * 繰り返しの可能: 複数形が語の繰り返しになるかどうか
 * 手順
 * 50,000語をイ語の基本語としてイ語独自の観点から選択
 * 品詞、構文情報、形態素情報を記述
 * 概念辞書との対応
 * 文解析
 * 中間言語への変換
 * 形態素解析
 * 構文解析
 * 意味解析
 * 文生成
 * 中間言語からイ語
 * 訳語選択
 * 共起辞書
 * 構文合成
 * 形態素合成
 * 評価
 * 300文程度のコーパス
 * まだ！！

Unknown