Summer Placement Diaries
Nikola's Summer Placement at Unbabel
During June and July, I carried out a two-month placement at one of the world’s most cutting edge companies in Machine Translation – Unbabel.
Unbabel works hand-in-hand with some of the world’s largest BPOs and outsourcing companies, easyJet, Booking.com, Rovio Entertainment, Under Armour, Pinterest to name a few. They get to translate up to 2 billion words per year for their customers. To achieve this goal there has to be a smooth network of engineers, linguists, translators, etc., working conjointly to ensure clients receive the message in their native language flawlessly. Unbabel is the rare example of a company combining the efficiency of MT and the experience and wisdom of human translators.
During the onboarding week, I had an insight into the company’s tools and processes. It was an opportunity to see in action all of the concepts I’ve learnt in the Machine Translation module (how is the MT model trained, MT evaluation, Quality Estimation, etc.). Additionally, I was shown the company’s latest developments, such as COMET – a framework for evaluating MT models, as well as the tools Unbabel uses, for example, to detect, annotate and anonymize named entities (NE), create and maintain translation memories, the platform where the translations in progress are tracked, the different pipelines (chat, tickets, FAQs), and all process occurring on them, sentence segmentation, tokenization, annotation, post-editing (PE), etc.
The overall goals and tasks were discussed and a detailed plan of the internship was designed with the help of my supervisor Vera Cabarrão from the AI Tech NLP team. Given my linguistic/translation background and the knowledge of several languages, we agreed on me working on the localization of named entities (names, cities and countries) from English into Czech and Spanish. Localization, specially of Named Entities, is also a very hot topic in translation industries and a crucial use case for Unbabel.
The following tasks were performed:
- Check existing internal language guidelines for Czech and Spanish;
- Updating the guidelines with the missing information;
- Annotation of data;
- Creation of Gold Standard (GS);
- Evaluation and comparison of Unbabel’s and Google’s MT engines;
Once the tasks were carried out the following questions were answered:
- What’s the overall quality of the Unbabel’s MT engine?;
- Were NEs captured correctly?;
- What were the main issues with the NEs?;
- Did Google perform better in terms of capturing the NEs?;
The first task was the one of checking and enhancing/describing the language guidelines for Czech and Spanish. Some of the information needed was not available, therefore, there was the need to rely on external sources, namely the Orthography of Czech and Spanish, respectively, the Corpus of Czech proper names, several articles and dictionaries available online. These rules were applied to a dataset in order to create the Gold Standard (GS). The dataset contained around 3600 sentences from English to Czech and Spanish from 8 different clients. After performing the annotation, Pedro Mota, from the AI side of the AI Tech NLP team, ran the data with Google which gave the opportunity to compare the performance of both MT engines.
Different results were obtained for different NEs. In terms of Czech proper names, the main issue was the incapability of the system to differentiate between nominative and vocative.
Source text: Hello Nikola
MT output: Nikola
Annotated: Ahoj, Nikolo
Unbabel and Google performed equally, translating correctly around 60 % of the names, therefore, when it comes to accurately localising proper names, creating simple guidelines and/or custom based rules could (eventually) lead to even better results.
With regard to countries and cities, both in Czech and Spanish, Google outperformed Unbabel.
The main problem for Unbabel were hallucinations and/or leaving the entities untranslated, whereas Google has never hallucinated and only presented minor mistranslations.
Source text: Niko is travelling from Macedonia to the Czech Republic.
Unbabel output: Niko cestuje z Velké Británie do České republiky.
Google output: Niko cestuje z Makedonii do České republiky.
Source text: Niko is travelling from Macedonia to Spain.
Unbabel output: Niko viaja desde Florencia a España.
Google output: Niko viaja desde Macedonia a España.
Consequently, possible solutions for improvement of the localisation of countries and cities could be found in updating the pipeline of Unbabel’s engine.
The only drawback of my placement was the fact that it lasted only 2 months, and I won’t be able to see further actions based on my work.
Post Written by Nikola Spasovski