Data Grooming for Machine Translation

Preparing training data for Machine Translation (MT) by using data grooming can seem like a mammoth task. And there is some truth to this. In practice, the huge variety of “contaminations” in training data plus high training data volumes can pose real challenges. This blog discusses why grooming is still a worthwhile investment and how automation can help save a lot of time and effort.

Why should you groom your training data before usage?

You want to employ Machine Translation that is adapted to your individual translation process and uses your comanpy’s own terminology? Then, you can make use of customized translation engines from one of the many providers and train them with your own data. Training data in this context usually consists of aligned translation units. These are defined as sets of sentences with an assigned source language and their respectively assigned target language.

Training requires a sufficiently large amount of training data whose quality is as close as possible to the desired result. If this is not the case and the data contains numerous or even systematic errors, the positive training effect suffers significantly. The MT could even systematically learn erroneous patterns. This is where data grooming comes into play. It can help find many problematic patterns and either remove or correct them before the engine training begins.

What kind of problems can negatively impact data quality?

The list of possible “contaminations” of MT training data is long. To give the reader an impression, several common problems with extracted training data are listed below:

Incomplete data, e.g. only the source language exists but the target language is empty. The target sentence could also be a partial translation of the source sentence (since explanatory additions are often not included in the translation)
Personal data: data that is critical for safety and data protection, e.g. addresses, phone numbers or bank account data
Unwanted or outdated terminology: the MT could learn the wrong terminology, e.g. outdated product or department names
Damaged or difficult to read translation units, e.g. due to defective encoding, missing white spaces or line breaks etc.
Inconsistent data, e.g. multiple sentences in one translation unit or inconsistent formatting for currency or date units

An individual translation unit might be unproblematic in itself, but can possibly become problematic within the overall context. For example:

Multiple occurrences of the same segment: a high frequency of those can lead to an unwanted weighting of those segments
Non-representative data: overrepresentation or underrepresentation of certain domains due to data availability can lead to an unbalanced engine training
Incorrect language mapping, e.g. target language contents within the source part of the unit or vice versa, resulting in model errors. In the case of domain-specific training: incorrect domain mapping

What can be done about such problems by means of data grooming and how is it done?

In a first step, potentially problematic segments have to be detected and marked. Here, it is practically unavoidable to use software that automates this process. MT training data often contains hundreds of thousands, even millions of translation units in practice, with basically no upper limit. The automated recognition of problematic translation units can range from very elementary processes to highly complex ones. For example, detecting segments with multiple occurrences in the training data can be done in a few lines of code. Identifying personal and organization names, in contrast, normally requires advanced methods as well as the support of external software packages.

For the further procedure with problematic segments, essentially two approaches can be distinguished: deleting problematic translation units or reparing them. Since keeping the training volume as large as possible is necessary for maximum training effect, repairing these units should always be preferred . However, it should be checked whether this is practicable and whether the repaired translation units remain usable for training.

Examples of repair processes for problematic segments

Code fragments mistakenly left in the text can be automatically recognized and deleted. Personal names, recognized by using Named Entity Recognition, can be replaced with anonymous placeholders. If concrete specifications of the desired terminology (e.g. in the form of a terminology database) exist and common errors in the usage of said terminology are known, it is possible to identify these cases and fix them manually. While automatically replacing terminology is feasible, it should be done with caution, requiring manual inspection in any case.

Examples for repairing translation units. Placeholders for Named Entities can be replaced by pseudonyms during the anonymisation.

Other problems – e.g. source or target segments left empty after other grooming steps – can’t be repaired. In this case the translation unit should be deleted. Thoughtfully designed, automated decision algorithms can help achieve the right balance between ensuring a certain quality as well as optimal training volume preservation.

Conclusion

Data grooming is part of the effort required to train MT-engines that should not be underestimated. It often requires creative and innovative methods of automatization. Although it usually reduces the training volume, it allows for more control over the training input quality and the replicated patterns. The time it takes to establish requirements, check the data for problems and groom it is therefore always a worthwhile investment. It allows the MT to learn clear patterns that are as error-free as possible. The resulting increase in output quality significantly reduces the considerable additional work required later on during post editing. Furthermore, the analysis stage during data grooming can provide important information on data quality as well as highlight optimization potentials within the translation process.

Striking the perfect balance between quality requirements and high training volume for the individual training scenario is central to the data grooming process. This is similar to human learning where complex patterns have to be repeated with sufficiently varied examples to understand them correctly. On the other hand, those examples should be largely conclusive in order to permanently consolidate the correct patterns.