Summer Placement Diaries
Ali's Summer Placement at Unbabel
Integrating human expertise and Artificial Intelligence (AI) technology to overcome language barriers in communication. This is a comprehensive definition that reflects Unbabel’s goal. Unbabel is a private company that was founded in 2013 with headquarters in Lisbon and San Francisco. The main activity of the company is an Artificial Intelligence (AI) powered human translation platform to facilitate the translation of customer service communications. Unbabel supports many well-known companies including easyJet, Booking.com, Rovio Entertainment, Under Armour, Pinterest, and Facebook. Unbabel helps global companies to give customers great service in their native language by reducing costs and response time while boosting customer satisfaction. Unbabel currently supports 29 languages and about 70 language pairs such as English to Spanish, French, Portuguese, Italian, and German. I have selected Unbabel for my placement in 2020 because they have professional researchers and programmers that work on the latest technologies of AI to promote the quality of Neural Machine Translation (Neural MT).
During the one-month placement at Unbabel, I joined AI-tech NLP group. The group’s activities are focused on Build and Rebuild stages on Unbabel’s MT pipeline. Build stage involves pre-processing steps before the Neural MT engine such as Split sentences, Tokenize and Annotate. The stage of Rebuild includes post-processing steps such as Transform annotations, Detokenize and Merge sentences that are applied to the output of MT. Each of these steps plays a crucial role in improving the performance Neural MT engine. Named-Entity Recognition (NER) is one of these steps that plays a vital role in the MT pipeline such as improving translation quality and protecting data of the customer. NER refers to determine special terms in a given sentence such as name of person, location or organisation that provide useful information for the translation process. One of the important challenges of the NER model is to provide the labelled data for training the model, especially for low-resources languages.
The main focus of this placement is to address this problem by using aligner algorithms. Word alignment is a Natural Language Processing (NLP) task that can be applied to a parallel text corpus to find the word-to-word correspondences in a sentence pair. It helps the NER system to determine named entities on the target text based on the information of the source text. The focus of the project is to match the alignments with the annotations on the source language (English) to determine the projection on the target language (Brazilian-Portuguese). Grow-diag-final-and and Intersect symmetrisation heuristics are two different approaches that we used to obtain alignments. After tokenizing the target sentences using Unbabel Tokenizer, alignment algorithms were applied to test dataset for extracting the annotation projection between the source and target languages. We also trained the NER model based on FLAIR framework on Brazilian-Portuguese corpus to extract the NER annotations from our test dataset. The performance of both approaches was compared with the gold standard annotations that manually labelled by a linguistic expert.
A detailed proposal for the project that was drawn up by supervisors of the placement, Vera and Pedro helped me to follow the project step by step. For each week of the one-month placement, different tasks were assigned. For the first week, I have been familiar with the structure of Unbabel the services that they provide to their customers. I also focused on the theoretical aspects of the project by studying the selected papers. The second and third weeks of the project, I started to program in python to prepare our datasets and run the aligner algorithms and the NER model. The last week of the project, I focused on analysing and comparing the results of the experiments and writing the placement report. At the beginning of each week, I had regular meetings with my supervisors to determine the tasks for the current week. At the end of each week, we had meetings with AI-Tech NLP group to discuss the results and challenges of tasks that were done during the week.
This one-month placement can be considered as a starting point for my Master’s dissertation. The topic of my dissertation is related to domain-adapted Neural MT. This placement was a good chance for me to develop my practical experience to work with a professional team as well as my programming skills.
Post Written By Ali Hatami