When an LLM evaluates other LLMs…

Modern language models can imitate human behaviour in impressive ways, generating coherent and linguistically clean texts in many languages. However, this makes it difficult to evaluate and compare such systems because differences often lie in the more subtle details, such as the exact word choice or stylistic and textual properties. Traditional automated evaluation methods, such…