Is it a match? Duplicates in terminology databases

“We have duplicates in our terminology, can you help us remove them?” We often hear this concern from our customers or interested parties. The short answer: Yes, of course. The long one begins with one counter-question: What kind of duplicates do we actually talk about?

Duplicate does not equal duplicate

There are different types of duplicates, especially when you are in the context of terminology databases (termbases for short). There would be…

  • Duplicates at string level: Only the pure text of the terms is identical.
  • Duplicates at term level: Not only the text, but all metadata fields and contents of this terms (e.g. status, source or definition) are identical.
  • Duplicates at concept level: All terms of a concept, including all metadata at the term and concept level, are identical.

Get rid of it – right?

Well, we don’t want to have duplicate entries in the termbase, that’s logical. Duplicates at concept level can therefore be removed without a guilty conscience. But it’s not that easy with the others. In the case of duplicates at term and string level, the entire concepts must always be compared before deciding what should happen with them. Especially purely textual duplicates are often found in termbases – and this (often) with full intent. This is because a term can be forbidden for several concepts, or it can act as a preferred term in one context and as a forbidden term in another. It is precisely this distinction that makes up the value of a concept-oriented terminology database. Deletion would therefore be rather counterproductive.

But if not delete, then what?

Comparing concepts/entries can result in two possible scenarios:

  1. The entries do not reference the same concept.
  2. The entries reference the same concept, are not identical at concept level, but contain duplicates at term or string level.

The 1st case has already been mentioned above: Here everything is as it should be and nothing needs to be done. In fact, that is the 2nd one who “disturbs” the most in a termbase. When looking up the terminology, it is not clear which entry is to be trusted – the reason for the termbase then no longer applies (in this case).
So if it’s the 2nd case, there may, under certain circumstances, be potential for merging the concepts into one. I say ‘under certain circumstances’, because this undertaking is also more complex.

The essence lies in the metadata

Before a merging we must always have a look at the metadata of the concepts. When comparing the metadata, two scenarios can then again be distinguished:

  1. The different metadata in the concepts do not contradict each other, i.e. they either complement each other (e.g. one concept has a definition and the other doesn’t), or they can be summed up (e.g. sources).
  2. The different metadata in the concepts contradict each other (e.g. status).

In the 1st case the concepts can be automatically merged. In the 2nd case it always needs the expertise of the relevant subject area. It must be coordinated which metadata is the “correct” one, before merging the concepts.

Not the same but similar…

And when we talk about merging, I do not want to ignore another borderline case : In addition to duplicates, there are often terms (usually at string level) in a termbase that are not identical, but similar. In the course of this three types of similarities are relevant:

  • Morphological similarity: Spelling variants or hyphens (e.g. “programme” vs. “program”).
  • String subsets: One term is part of another (“camera” in “rear view camera”) or has one part in common with another (“rear view camera” and “front view camera” ).
  • Semantic similarity: Synonyms in which the string can be a completely different one (e.g. “tiredness recognizer” and “break recommendation”).

The goal of identifying morphological and semantic similarity candidates is usually to find forbidden terms. They can potentially be merged and the concepts can be treated in equivalence to the duplicates. String subsets, which usually refer to different concepts, can support the creation of taxonomies and concept maps that map relationships between related concepts.

And this is how we do it

Using the blc Data Toolkit, we can remove duplicates at the concept level and identify and mark the other duplicate types, subsets and morphological similarities. The blc Data Toolkit can also determine semantic similarity using AI; however, we need a large amount of continuous text for doing this. In addition, conflicts can be automatically identified and highlighted in potentially mergable concepts. Then we go through the potentials manually and process them until only the really critical conflict cases remain for our customer. The procedure always requires a close cooperation between our terminologists and computer linguists, as the data has to go through several manual and machine loops.

Conclusion

There are different types of duplicates (and similarity candidates). How best to handle them to get the best out of the termbase depends on their nature. Sounds complicated? Not with our help! 

You couldn’t see my presentation at the tcworld conference 2023? From 27 November, you can watch the recording of my presentation in the ‘Tagungstool‘: “Vom Sprachen-Stau zur Terminologie-Autobahn – Wie Porsche das Wörter-Wirrwarr angeht”. I will of course be happy to answer any questions you may have.

For anyone who wants to find out more about how to master their own terminology: We have something for you.

On 14/15 March 2024, Terminologie³ 2024 will take place once again at the Novotel in Karlsruhe. An event where terminology enthusiasts, whether beginners or full professionals, will get their terminological money’s worth. Early bird prices are also available until 10 January 2024 😉

Tags:

Related Posts