Evaluation of Large Language Models (LLMs) and Pairwise ranking

2 min readFeb 23, 2024

In a recent presentation, I discussed the evaluation of Large Language Model (LLM) systems, which prompted inquiries about the application of game-style ranking methodologies, such as the Elo-style ranking, within the context of LLM system competitions, exemplified by the LLMsys Chatbot Arena https://lnkd.in/gaMaVbnA
and the Negotiation Arena paper.
https://lnkd.in/gfsDA8JU
A critical subset of ranking algorithms includes the TrueSkill algorithm https://lnkd.in/gjw3Yqtb https://lnkd.in/gEpi8MbS and the Bradley-Terry model, https://lnkd.in/gbgzZJrN among others. These models primarily focus on the assumption-based prediction of pairwise comparisons, which can be likened to predicting the outcomes of “games” in a more formal sense. Historically, such algorithms have been prevalent within the gaming community, not only in chess — where the Elo rating system is well-known — but also in forecasting the results of various other games. See a good introductory book https://lnkd.in/g9NpHgpA
These methodologies have gained recent popularity beyond their traditional gaming applications, now being utilized in the comparison of computational models. This surge in interest is evidenced by comprehensive reviews, such as the one conducted by the Cohere team https://lnkd.in/gnwfkeJt.
Moreover, these pairwise comparison methodologies have demonstrated their utility across a broad spectrum of search/recommendation/personalization ranking challenges, extending into the realms of traditional search, recommendation systems, and beyond. They are noted for their rapid convergence and their tendency to yield results that are perceived as more equitable, compared to pointwise ranking methods. This aspect of fairness in ranking outcomes is particularly nuanced and necessitates a detailed discussion on the metrics of fairness.
For instance, in the domains of local or e-commerce search, recommendation systems, and personalization strategies, relying solely on pointwise metrics of popularity and relevance can inadvertently perpetuate bias, especially in scenarios involving disparate population sizes. Conversely, the application of pairwise comparison techniques can ameliorate such biases, leading to outcomes that are not only fairer but also potentially more effective in driving conversions. Of course, due to the sparsity of pairwise data, one typically builds hybrid systems that utilizes both pointwise and pairwise data

Evaluation of Large Language Models (LLMs) and Pairwise ranking

Written by Andrei Lopatenko

No responses yet