Summer Placement Diaries

Kateryna Poltorak's Summer Placement with Sketch Engine and University of Johannesburg.

The first part of my EMTTI placement I spent with Sketch Engine (SE). SE is a web-based service that offers a wide range of pre-loaded and ready-to-use corpora in more than 90 languages while also includes features for building, annotating and analysing specialised textual corpora. SE platform is widely used by language researchers, language teachers, translators, interpreters, and students. SE algorithms examine text corpora containing billions of words to determine what is common in language and what is uncommon, unique, or new. This tool is also suitable for text mining and text analysis applications.


I started my placement by completing the Boot Camp Online (BCO) course. It was initially a two-day event which the SE Team virtualized during the pandemic. The course covers all of SE’s features for corpus searches, corpus analysis, and corpus creation. The BCO course helped me to get started with the terminology of SE, refresh some concepts that I learned during this academic year at the University of Wolverhampton and prepare for my dissertation project. The course consists of several sections that embedded links to external websites, the Sketch Engine interface, YouTube videos explaining the usage of the tools, images, books, practical exercises and quizzes.

The Introduction part of the course taught the basics such as corpus size, balance, representativeness and terminology including token, lemma, tag, lowercase, lempos. Tools, part 1 section covered how to use the word sketch, the difference between a word sketch and a wordlist, a recap on regular expressions and NLP principles linked to the tools. Tools, part 2 section taught the notion of n-grams and how to identify words and phrases which are typical or representative of the content of a piece of text, i.e. key words and terms. Tools, part 3 was explained the concordance searching and processing concordance results. A special section was dedicated to the Corpus Query Language (CQL) which is a unique code that allows to search the corpus using any piece of data, or a combination of data.

Annotating a corpus section presented SE features for enriching a corpus with metadata (POS tags, lemmas and other attributes or text types). Multilingual features section explained the concordance search in parallel or translated texts, parallel corpus building, multilingual word sketch and bilingual term extraction. It was particularly interesting for me to know that SE Team are developing tools for semantic annotation of corpora which means enriching a corpus with information about how humans understand the meanings of the words in a text. Semantically annotated corpora constitute a very useful tool for language research, and it is especially useful in natural language processing and machine translation. At the end of the course, I completed a final assignment that consisted of:

  1. building a thematic corpus using the ‘corpus from the web’ feature
  2. checking the corpus quality by looking at keywords, terms, wordlist for any suspicious items
  3. finding out websites which gave the most tokens to the corpus
  4. find out about which countries contributed text to the corpus
  5. picking the 10th most frequent lemma in the corpus
  6. generating the most frequent n-grams containing the 10th most frequent lemma
  7. picking one of the words (or lemmas) in the corpus and providing a list of websites where it is found


During the second part of the placement, I practiced Audiovisual translation (AVT) under the supervision of Prof. Eleanor Cornelius from the University of Johannesburg (UJ) in South Africa. Prior to the start of the placement, Prof. Cornelius organized a virtual meeting during which she asked me about the practical skills I would like to acquire during my placement. As I had always been intrigued by AVT, I suggested translating from Russian into Spanish a one-hour episode of a Ukrainian traveling vlog that went viral in the past couple of years. Here is the link to the vlog. Prof. Cornelius welcomed my idea and we outlined a plan for the placement. Potential partners from other universities who are experts in the field of AVT were also identified and approached by Prof. Cornelius.

During the first week I familiarised myself with Aegisub subtitling software, one of the more popular applications used by subtitlers. Additionally, another advantage of using Aegisub is that it is available without charge. During this week Prof. Cornelius organised a number of webinars to help me get started with this application. Laurinda van Tonder, a lecturer at North West University (NWU) and a PhD student of Prof. Cornelius at UJ, who extensively researched fansubbing and creative subtitling, introduced me to the theory and the guidelines for subtitling and shared some “do’s and don’ts” of the visual representation of subtitles. During a two-hour session with Dr. Gordon Mathew, Research Technologist at NWU and an expert in subtitling for educational purposes, I learned the technicalities and best practices of working with Aegisub, including subtitling conventions such as the maximum number of characters per line, the maximum duration of a subtitle, the location and style of the text, the time frames and the most frequent short-cuts.

In the second week I concentrated on the linguistic part of the project which posed some translation challenges. One of my concerns was whether a subtitler should preserve all the punctuation marks or rather omit them to facilitate the reading ease of the text. Another thing that puzzled me was whether I should translate everything in the second person plural as in the original text or make it shorter (more condensed) and easier for the reader to process. Still, the biggest challenge was to preserve the humour. In one of the scenes, the author of the vlog says “This is my recomendación!”, intentionally using the Spanish word to sound more cool. This is a “lost-in-translation” case – given that I was translating into Spanish. We all remember the iconic “Hasta la vista, baby” from Terminator 2. Interestingly, Spanish translators made Arni speak Japanese. In the Peninsular Spanish version of the film, the protagonist says “Sayonara, baby”. I am still considering  some options for a substitution similar to this one J.

The third week I spent adjusting the timing in Aegisub while also revising my translations. Some of the sentences required partitioning. Sometimes I needed to find a substitute for a long word or phrase while preserving the meaning. This is the subtle art of subtitling – together with rendering the humour to not exclude the target audience. Besides, I was also playing around with the file formats for subtitling and text extraction. At the end of the placement, we had a virtual event to which Prof. Cornelius invited academics with extensive expertise in AVT and some of the students of UJ. During this event, I presented my project and participated in a fruitful discussion on subtitling.

I am grateful to Miloš Jakubíček and Ondřej Matuška for introducing me to the CQL and for giving me the opportunity to learn and practice corpus query techniques on Sketch Engine platform. I would like to thank Prof. Cornelius for broadening my understanding of the technical side of subtitling during this placement and for being always very accommodating and helpful. And I am particularly grateful to the EMTTI Coordinators and all the staff for providing students with such remarkable placements options. I would highly recommend Sketch Engine and the University of Johannesburg as placement options to any future student in the EMTTI programme.

Post Written by Kateryna Poltorak