Different methods for term extraction

After we explained in our last blog what term extraction actually is, we are now devoting ourselves to the methods of term extraction. Before a company approaches term extraction to build a termbase, it is important to look at the need and available resources to establish a meaningful approach. This is the best way to establish a sensible procedure. As we at blc are committed to designing efficient processes for optimum output, we are starting today with the prerequisites. What methods and tools are available for extracting terminology from the source texts?

Manual vs. automatic term extraction

Manual term extraction searches for term candidates in the source text using a visual inspection. The advantages are that the terminologist examines the technical terms in their immediate context. With the help of his/her terminology expertise, he/she can assess whether the candidates are term candidates. The drawbacks are that manual term extraction is very time-consuming, depending on the document quantity. In addition, the results depend on the individual’s assessment. The alternative to manual term extraction is automatic term extraction.

A list of term candidates from selected source documents is generated with machine support. Manual checking of the output by a terminologist is essential: A machine cannot be able to assess whether the extracted words or phrases are actually terminology. Nevertheless, a major advantage of automatic term extraction is the considerable amout of saved time compared to manual term extraction: Instead of the complete source documents, only the automatically generated term candidate lists must be checked.

Monolingual vs. multilingual term extraction

In monolingual term extraction, source language terms are extracted. The transfer to other corporate languages can be done downstream after inclusion in the termbase.

An alternative is bilingual or multilingual term extraction. Here, the target-language equivalents are immediately assigned to the terms from the source language. Translation memories or aligned source and target documents are used as a starting point.

Statistical vs. linguistic term extraction

Statistical term extraction evaluates how often individual words or word combinations occur in documents. As a rule, the tool can be configured as to how many occurrences of term candidates are to be extracted and how many words a term candidate may consist of. Furthermore, methods for calculating statistical correlations such as co-occurrence measures are used. Co-occurrence measures are determined on the basis of the frequency of occurrence of words. They tell you whether the common occurrence of two or more words is random or not.

In a purely statistical extraction, the words are not analyzed. There is therefore no filtering by word types. This means that the extraction material contains a high proportion of general language terms. Stop word lists that exclude certain words (e.g. general terms, prepositions and conjunctions) from extraction can help here. However, the term candidate list from a statistical term extraction usually requires intensive post-processing.

Linguistic term extraction on the other hand is based on the analysis of the morphology and syntax of the documents. Thus, the word types can be determined (via the so-called ” tagging “) and the term candidates can be traced back to their parent forms (“stemming“). Morphological rules for the individual languages can be stored for analysis.

General-language dictionaries can optimize the results. Classifiers that have been trained on texts with word type information are often used for word type determination. Due to the dependency on the language-specific rules and classifiers, linguistic term extraction is only available for a limited number of languages.

Is both possible?

Through comprehensive analysis, linguistic extraction typically delivers higher quality results than statistical extraction. However, linguistic term extraction does not take into account the frequency of words, which often allows valuable conclusions to be drawn about the relevance of technical terms.

Hybrid term extraction is a solution for taking both solutions into account. This includes a determination of the frequency and the word types, but also a linguistic analysis of the term candidates and traceability to the root forms. Elaborated systems generally combine both approaches.

Source: Jelleke Vanooteghem

Conclusion: The purpose determines the means

In search of the most effective method, automatic extraction “wins” over manual extraction. However, it is also advisable to encourage the authors to input term suggestions directly during text creation in order to recognize and control terminology at an early stage. The choice between monolingual vs. multi-lingual term extraction and statistical vs. linguistic term extraction depends on various factors. These include the following:

  • volume of text per year
  • available capacity
  • expertise of the people involved
  • and of course also the systems available


Would you like to learn more about the topic and find out which method of term extraction fits best into your processes? Contact us! We will be happy to advise you and work with you to develop your optimal terminology process.

Image by Janko Ferlič on unsplash

Related Posts