Different methods of term extraction

Different methods of term extraction

After explaining what term extraction actually is in a previous blog, we turn our attention to different term extraction methods. Before a company tackles term extraction to build terminology, it is important to understand the needs and the available resources. This is the only way to set up a sensible procedure. Since our team at blc is committed to implementing efficient processes for optimal output, we will start with the prerequisites. What methods and tools are available for extracting terminology from source texts?

Manual vs. automatic term extraction

Manual term extraction uses visual inspection to search for term candidates in the source text. The Advantage with that is that the terminologist examines the terms in their immediate context. In doing so, he can use his terminological expertise to assess whether the words are term candidates or not. The disadvantage on the other hand is that manual extraction is very time-consuming, depending on the document volume. Moreover, the results depend greatly on the individual assessment.

The alternative to manual term extraction is the automatic term extraction. In this process, a list of term candidates is generated from selected source documents automatically. Following that, checking the created list of term candidates manually is essential. A machine cannot assess whether the extracted words or word groups are actually terminology. Nevertheless, a major advantage of automatic term extraction is the considerable amount of time saved compared to manual term extraction. Instead of having to check all source documents, one only has to check the term candidates generated by the machine.

Monolingual vs. multilingual term extraction

In monolingual term extraction, terms are extracted only in the source language. The transfer into other corporate languages can be done downstream after inclusion in the terminology database.

An alternative is the bilingual or multilingual term extraction. Here, terms from the source language are immediately assigned to their target language equivalents. Translation memories or aligned source and target texts serve as a starting point.

Statistical vs. linguistic term extraction

Statistical term extraction evaluates how often individual words or word combinations occur in different documents. There are certain aspects that can usually be configured directly within the tool. For example how often a word has to occur in order to be extracted as a term candidate. But also of how many words the term candidate should consist of at most.

Furthermore, methods for calculating statistical correlations such as co-occurrence measures are used. They are determined on the basis of occurrence frequencies of words. Like this, they provide information on whether the co-occurrence of two or more words is random or not.

In a purely statistical extraction, no analysis of the different words takes place. Thus, there is no filtering according to word classes. As a result, the extracted material contains a high proportion of general language terms. However, this can be remedied by stop word lists. They exclude certain words (e.g. common nouns, prepositions and conjunctions) from the extraction. Nevertheless, the list of candidates created with statistical term extraction usually requires intensive post-processing.

Linguistic term extraction, on the other hand, is based on the analysis of the morphology and syntax of documents. Thus, it is possible to determine the word classes (“tagging“) and to trace the term candidates back to their principle parts (“stemming“). For the analysis, morphological rules can be stored for the individual languages. General language dictionaries can optimize the results further. Oftentimes, classifiers trained on texts with information regarding word classes are also used to determine different word classes. Due to the dependency on language-specific rules and classifiers, linguistic term extraction is limited regarding a certain aspect. It is only available for a limited number of languages.

Can you do both?

Due to the comprehensive analysis, linguistic extraction usually provides higher quality results than statistical extraction. However, linguistic term extraction does not take into account the frequency of words. And this is something that often allows valuable conclusions about the relevance of technical terms. A solution to take both requirements into account is the hybrid term extraction. This method includes not only the determination of frequency and word classes, but also a linguistic analysis of the candidates. Furthermore, it also entails tracing them back to their principle parts. So, good systems usually combine both approaches.

Conclusion: The end determines the means

In search of the most effective method, automatic extraction “is victorious” over manual extraction. In addition, however, it is advisable to encourage authors to submit term suggestions directly during text creation. This way, terminology is identified and included at an early stage. The choice between monolingual vs. multilingual term extraction and statistical vs. linguistic term extraction depends on various factors. These include the following:

  • volume of text per year
  • available capacity
  • expertise of the people involved
  • and of course the systems available.

Interested?

You would like to learn more about the topic? Or you would like to find out which method fits best into your processes? Contact us! We will be happy to advise you and work with you to develop the optimal terminology process for your needs.

 

Image by 🇸🇮 Janko Ferlič on unsplash

Related Posts