Everyone has been talking about Artificial Intelligence (AI) since the introduction of ChatGPT in autumn 2022. Developments in this area are rapid and exciting. Many tasks, which previously could only be done by humans, are now also accomplished by machines. Technological progress is driving the development of new AI applications at an enormous rate – also in the language sector. However, with the large selection of AI tools and system vendors for all possible application scenarios, the choice is not easy. A quality evaluation can help!
We are giving an overview of the possibilities of evaluating AI (in terms of language processes) and future perspectives. Do you have any further questions about Artificial Intelligence for your language processes? Contact us and together we will find the optimal solution for you!
Why should you rate quality?
Scenario number 1 is the selection of the appropriate AI application.
AI has many faces. What is the use case, what is the task that needs the help of AI? Once the target question has been clarified, it is possible to decide which type of AI is suitable for the individual use case and which applications are possible. The requirements usually result in a shortlist of methods or systems that can be tested against each other in order to decide which one will be ultimately introduced.
Scenario number 2 is the regular quality evaluation after choosing an AI application. This means that the AI is constantly being improved, e.g. through (re)training, flexible RAG pipelines or by optimizing the data.
And how?
Different methods can be used to evaluate the results. Depending on the AI use case, different metrics are appropriate. Here are a few examples:
Metrics for text classification
Example use case: Sentiment analysis, hate speech detection, sorting of e-mails.
- Precision: The percentage of true positive instances among all predictions classified as positive.
- Recall: Percentage of correctly detected positive predictions among all actual positive instances.
- F1 Score: Harmonic mean value of precision and recall.
Natural Language Processing (NLP) metrics
- BLEU: Evaluates how similar a generated text is to a reference text, based on matching strings. Example use case: Machine translation.
- BertScore: Assesses the semantic similarity between a generated text and a reference text. Example use case: Text summary, context adaptation.
- Word Error Rate: Percentage of misrecognized words compared to the reference. Example use case: Speech recognition.
- Fluency: Fluid readability and grammatical correctness of a text. Example use case: Text generation.
Metrics for chatbots
- Diversity: Determination of the proportion of stereotyped or repetitive responses.
- Relevance: Relevance of a generated answer in relation to the question asked or the previous context.
Some of these metrics can be calculated automatically, using algorithms. For others, human evaluators are needed. While human evaluation is more costly and time-consuming than automatic evaluation, it provides detailed and valuable insights. Due to the fact that all metrics have strengths in some use cases and weaknesses in others, it is always advisable to combine several. In this way, different aspects of language will possibly be covered with different metrics in order to get a holistic picture of the result quality. In the best case, a mixture of human and automatic evaluation should be carried out.
Vision: Self-evaluating AI?
In view of the rapid progress in AI development, the question arises: Will AI models be able to evaluate the performance of other models themselves in the future? This idea raises exciting perspectives on how AI-based evaluation systems could be used in the development and optimization of AI.
In fact, the idea is not as innovative as it sounds at first. In the context of machine translation, automatic quality assurance is already being used. The so-called Quality Risk Estimation is an AI model that assesses machine-translated texts with regard to their quality.
LLMs could be used for evaluation and improvement in the context of text creation due to their analysis skills. The basis for a system in which one AI evaluates another lies already in today’s training mechanisms and evaluation metrics. In the future, so-called meta models could be specially trained to analyse not only the results, but also the architecture, training processes and the learning abilities of other AI models.
Conclusion and outlook
If AI systems are able to effectively evaluate and improve the performance of other AI systems, this could revolutionize the entire AI deployment and the process of its development. However, one should not forget that such meta models also have to be evaluated so that an AI error does not run through the entire pipeline. The solution: The use of clean processes for quality assurance and hybrid evaluation concepts, individually tailored to the respective AI, and with humans as the final control authority.
Do you want to learn more about AI and AI evaluation? Then get in touch with us!