English to Assyrian/Syriac Translation Model
Demo, is hosted on HuggingFace hub, simply enter you sentence and click on “Compute”: https://huggingface.co/mt-empty/english-assyrian
MOTIVATION
Natural language processing is an exciting area that has been making many advances in recent times, such as GPT3 and BERT. These models are widely used in the industry to run all types of applications, such as speech recognition, machine translation, algorithm trading and even code generation.
DeepNeuron were interested in adding Natural Language Processing (NLP) to their bank of skills. Matti Haddad, proposed to build a translation model from English-to-Assyrian, he was the only one in the team to speak the language. We were inspired because this was something that had never been done before, and there are very limited resources out there on Assyrian language. Taking on this project would be an excellent gateway to incorporating more natural language processing concepts to Monash DeepNeuron, which would facilitate more NLP-related research in the future. Furthermore, taking on this project would allow us to expand the online deep learning resources for a lesser known language such as Assyrian.
MACHINE TRANSLATION
Machine translation is the process of using artificial intelligence to automatically translate content from one language (the source) to another (the target) without any human intervention. Machine translation is widely used nowadays in everyday life, this includes, dictionaries, live translation, audio interpretation, an example that you have probably used is google translate.
The source and the target are not constrained to different languages, it can be from Standard English to Twitter English, the model will always learn to associate the source language to the target language
Machine translation is a difficult problem because nuances in language are difficult for machines to capture because words are not mapped one to one from language to language and there can be multiple ways to say the same thing. For example, the translation of the classic saying "The apple doesn't fall far from the tree" would not make sense in some other languages because other languages would use different phrases to describe the same idea.
Thus many factors need to be taken into consideration during translation such as context and gender (certain words in languages are deemed masculine or feminine).
ASSYRIAN LANGUAGE
Modern Assyrian is also known as Assyrian Neo-aramiac, Suret, Sureth, Eastern Syriac, Chaldean. Is heavily influenced by of Syriac (which itself has roots in Aramiac), but some of its vocabulary comes from ancient Assyrian language. The language was mainly spoken in northern Iraq, South East Turkey and North West Iran, but because of mass migration due persecution, native speakers have dropped significantly. Assyrian is in the same family as other semetic languages such as Hebrew and Arabic.
In our modern days, Assyrian is spoken by some groups in their native homeland, but mostly in diaspora. USA, Australia, Germany, Sweden and other countries.
In the next section we use the term Syriac to refer to Assyrian, this is because our dataset mostly comes from Classical Syriac.
HUGGINGFACE
Huggingface is a platform that provides open source NLP technologies, these include datasets, models and well documented libraries. It also provides an easy to follow course that goes into many topics needed to get started with NLP related applications[1].
We were able to easily load models and datasets using their librarie and focus on our goal without having to worry about downloading, preprocessing or formatting.
We did however face some limitations with Huggingface resources, especially when it came to training language models from scratch. This is because a lot of the tutorials and examples relied on pre-trained models. This became a challenge as our task required training a tokenizer from scratch for the chosen architecture, which HuggingFace did not provide one natively.
[1] https://huggingface.co/course/chapter1/1
Implementation of English to Syriac model
In this section we go through our journey of implementing the model. This includes the dataset, tokenizer, training and evaluation.
MARIANMT ARCHITECTURE
The pre-trained models that we used are based on the MarianMT[2] Machine Translation model which uses a combination of transformers and deep RNN’s architectures. RNN is a neural network that differs from a traditional feed forward[3] neural network as it takes in time series or sequential data as input such as text paragraphs or graphs over time. RNN’s have a form of “memory” and take into account the states of previous inputs when they generate the next output of a sequence. This video explains RNN’s in greater depth:
Transformers are a type of NLP architecture that also solves sequence to sequence problems and that was introduced in the “Attention is all you need” paper by Vaswani, et al which marked a significant breakthrough in NLP[4]. The transformer is made up of an Encoder and a decoder. The encoder is used for mapping the input sequence into representations that can then be taken by the decoder for generating a sequence of outputs. The encoders and decoders contain sub layers which conduct further complex processing of the inputs and outputs such as normal and masked multi-head self attention mechanisms, normalisation and fully connected feed-forward networks. This video explains it in greater depth:
[2] https://marian-nmt.github.io/
[3] https://en.wikipedia.org/wiki/Feedforward_neural_network
DATASET
Before any work could be done with the model, we first had to find a dataset which was large enough for our model to be properly trained on. Unfortunately, this was much easier said than done for our project. Initially, we were trying to find a dataset for Assyrian. However, resources for this language were extremely scarce due to its small number of native speakers. Therefore, we chose to change the translation language to any Syriac text. We managed to find a very small dataset comprising roughly 3200 pairs of Easter Syriac sentences with their corresponding English translation. To put this in perspective, a regular English-German dataset would contain more than 9 million sentence pairs.
TOKENIZER - SENTENCEPIECE
For our model to read and learn off our dataset, we had to train a tokenizer to convert all our sentences into a series of tokens. Tokens act as numerical representations of the sentences provided by the dataset. Tokenizers work by splitting the sentences from the dataset into smaller pieces. The way in which sentences are split up varies depending on which algorithm is used, however one of the more common and easy to understand ways to do this is to split up a sentence every time whitespace is present. For example, the sentence “The cat sat on the mat.” would be split up into a list with the tokens [“The”, “cat”, “sat”, “on”, “the”, “mat”, “.”]. The algorithm we used to do this is called SentencePiece, which was developed by Google. We used it because it is a very common algorithm that is used for tokenization and it is also language agnostic, meaning that it has support for all languages. From here, the tokenizer would then assign each of the unique tokens generated a specific token id, which is just a number representing the token. For example, our tokenizer would encode the phrase: “hello, world” to [3855, 285, 250], with each of those numbers being a token id for a unique token. The tokenizer can also be used to decode tokenized messages in order to produce a human-readable sentence. So, inputting [3855, 285, 250] into the tokenizer would return “hello, world” back to us.
TRAINING
After training our tokenizer to recognise and tokenize both English and Syriac text, we could begin training the model. For this project, we chose to fine-tune a pre-trained model which was trained in English to Arabic translation. We chose this for several reasons.
The most important reason is that training a model completely from scratch takes significantly longer than to fine-tune a pre-trained model. While training from scratch may produce marginally better results, the extra time required far outweighs the gain, especially since our dataset was extremely small and likely will not produce perfect translations in the first place.
Additionally, using a pre-trained model on another semitic language with different scripture, allows for transfer learning because these languages share common features[5].
We initially tried to train our model using this small dataset(3500 pairs), however this resulted in our model only returning a few characters since our model was simply unable to recognise any text that wasn’t directly pulled from the dataset. In order to circumvent this issue, we resolved to expand our dataset by including a Classical Syriac Bible dataset, which was much bigger, with just less than 16,000 sentence pairs. This was still too little, but was big enough to provide a more accurate translation. However, it should be noted that the dataset is not representative, it comes from a religious domain, therefore our model favours translations which use religious lexicon.
We were able to directly observe the effects that our tokenizer and training had on the model. We did this by translating several sentences before training and comparing the output to post-training. Before training, the model was obviously outputting Arabic text when prompted with English words to be translated. However, after training, the model returned Syriac text which served as a loose translation from the English prompt. For example, before training the model, the model returned “محل المسكن” when prompted with the word “home”, this is the arabic translation and uses arabic scripture. after training on Syriac dataset we got,”ܒܲܝܬܵܐ”, this is Syriac and it uses Syriac scripture.
Therefore, our tokenizer and training process had clearly had some impact on how the model processed the inputs and had been fine-tuned to translate English to Syriac.
[5] https://arxiv.org/pdf/1906.01502.pdf
EVALUATION
In order to evaluate the model, a metric is needed to score the output. A widely used metric called Bi-Lingual Evaluation Understudy (BLEU)/BLEU is a language independent metric which compares the words(called n-grams) of the candidate translation with words of the reference translation to count the number of matches. These matches are independent of the positions where they occur. The more the number of matches between candidate and reference translation, the better is the machine translation.
Our evaluation score was 33/100, which is consistent with other models in the same architecture, but there is still room for improvement, such as optimization or expanding the dataset, which could be explored in the future.
It's worth noting that the score is not indicative of a good translator that can be used in real world application. Rather it’s a measurement of the similarity of the predicted translations vs target translations in our dataset.
CONCLUSION
This has been an extremely exciting project as despite our many challenges, we have been able to create our own dataset and train our own model. While we believe that our model has plenty of room for improvement, this project was nonetheless a valuable learning experience and will help us to navigate the obstacles faced in future projects. We are proud to have been able to contribute to the growing Assyrian/Syriac NLP resources and community which we hope will continue to expand from here.
Demo: https://huggingface.co/mt-empty/english-assyrian
project team
Project Manager: Matti Haddad
Project Members: Allister Lim, Wei-Yee Hall, Cameron Sturgeon