Machine Transliteration on South Asian Languages

Abhishek Babu, Arkaprabha Bhattacharya

Project Repository: https://github.com/arkabhat/cse490g1-final-project

Project Video

Problem Statement

Transliteration involves taking words of some language and phonetically writing them in the script of another.

We see this phenomenon culminate especially when transforming proper nouns from one language to another, say from Chinese to English. Further, we also see transliteration as a tool for communication between individuals fluent in some language and individuals who are fluent in speaking and comprehension of said language only. Individuals who are fluent in speaking and comprehension but not in the writing and reading of that language's script, have the option of phonetically transliterating their thoughts into another language to communicate. However, we identify the following problems surrounding such:

If words or phrases are not properly transliterated, individuals may be misunderstood
If transliterated text is being typed, individuals may run into big grievances with autocorrect attempting to change their transliterated texts

A possible means of remedy, as a result, is a transliteration model. Given a certain input language for transliterated text (say initially written in English), this model would be expected to convert that text into the native script of the language an individual is trying to speak. A transliteration model can remedy the issues many language speakers face if integrated into larger applications, and hence we decided to focus on developing and better understanding how transliteration models work and more specifically, what implications we can draw regarding the relationships between similarities of languages based on them.

In our review of literature, we identified a relatively small body of work overall that looks at machine transliteration. Many of the models developed (see related works for additional information) address transliteration between English and languages across Asia and Europe. However, we have noticed a relatively small number of these academic works looking at the proliferation and relationship of South Asian languages and the implications of transliteration for them. Consequently, we devise the following thesis that we attempt to explore in this project:

Due to the inherent similarities in South Asian languages, we set out to create a transliteration model between English and a set of these South Asian languages, ultimately hoping to better understand the nuances of transliteration and how to improve these models in the future to potentially be integrated in products.

Methodology

We first worked to analyze other models based in various languages (as noted in our literature review). It stood out that a large number of transliteration models are based on recurrent neural networks, mainly GRUs and LSTMs, and hence we chose to develop an initial architecture mirroring that of the paper by Ameur et. al, (which we go into depth about in the following section). We looked for robust initial datasets that suited our purposes, ideally for multiple south asian languages, and came upon the Dakshina dataset by google, which provides transliterated word data for multiple South Asian languages. Each data point in this data set includes a lexical transliteration, as well as a score representing the number of annotations that transliterate that word that way. For our training and testing data, we chose to use the transliterations that had the highest frequency score, only using multiple different transliterations of the same word if the highest frequency occured more than once.

Model Architectures and Hyperparameters Used:

We devised the following two architectures to test, based on what we saw in our review of literature.

Our first encoding layer mapped from the input language vocabulary size to 128. The middle fully connected layer mapped from 128 to 128 features. The final decoder layer mapped from 128 to 128, before being taken down to the output feature space by the final fully connected layer. We chose to include this intermediate encoding layer based on our hypothesis that it would help our model deal with different intricacies between languages better, creating a model architecture that is well suited for use across similar languages. Further, we processed our data by max-length padding them to the length of the largest input and output data points. This middle layer also serves to help the model better understand the beginning and ending of words as we transform the encodings of our input language and work to decode the transliteration of them. We trained and tested with a learning rate of 0.005 and a weight decay of 0.000005 for 20 epochs and used cross-entropy loss for training.

Experiments Run

We test both of these models across four languages: Bengali, Hindi, Tamil, and Malayalam. We choose these languages specifically due to the inherent similarities in phonetic and lexical structure between Bengali and Hindi as well as Tamil and Malayalam.

Evaluation and Results

The following are the results of our experiments.

Demo

To access instructions to our demo of the Tamil model, visit the following link: https://github.com/arkabhat/cse490g1-final-project/blob/main/README.md

Conclusions

Altogether, it is apparent that our GRU-based architecture seemed to work better than our LSTM-based architecture for all four languages. Further, we noticed that the architectures generally performed better on the Tamil and Malayalam data. Further, trying to have the model infer our own 'new' data points, it seems the model does a better job picking up phonetic nuances of Tamil and Malayalam in comparison to Hindi and Bengali. This is a curious effect, that could be attributed to a number of things including the size of training data and our network itself. Further, we wonder if it could be the case that Hindi and Bengali, coming from different linguistic roots than Tamil and Malayalam, include more rules-based orthography and linguistic structure in comparison to Tamil and Malayalam, making a model with data preprocessing similar to the one we observed for the Armenian data more suitable for them. Tamil and Malaylam may be based on more phonetic connections, or may just have more straightforward mappings to English. Despite this, we identify that these types of architectures are well-suited for South Asian languages. Further, we are curious about how this architecture would perform when transliterating between South Asian languages, with our hypothesis being that the accuracy would be much higher due to the similarities between some of these languages. Further, we would love to take a deeper dive into how such a model could be further augmented to be a general transliteration model, where just a few epochs of training on another language mapping could make it an acceptable model for transliterating in that language. We think that this extension could have great impacts and implications for integration into keyboards and autocorrection, as well as for larger machine translation networks that need a mechanism for dealing with words not in their vocabularies.

References

Rosca, Mihaela, and Thomas Breuel. "Sequence-to-sequence neural network models for transliteration." arXiv preprint arXiv:1610.09565 (2016).
Ameur, Mohamed Seghir Hadj, Farid Meziane, and Ahmed Guessoum. "Arabic machine transliteration using an attention-based encoder-decoder model." Procedia Computer Science 117 (2017): 287-297.
http://yerevann.github.io/2016/09/09/automatic-transliteration-with-lstm/
https://aclanthology.org/W18-2414.pdf