Who validates the Validators: Evolution of LLM Evaluation

Andrei Lopatenko
2 min readApr 22, 2024

--

A new LLM evaluation paper from UC Berkeley
Humans are good to evaluate LLMs, but human evaluation is hard to scale
LLM evaluation of LLMs helps in some cases (See LLM as a Judge part of my compendium https://lnkd.in/gD2suRCz, there are many results reported in this area) but LLM is far from being a perfect evaluator for many tasks
The authors work on a mixed approach, can LLM *assist* (not replace) humans in LLM grading. Using their approach, “validate the validators”, their open source system EvalGen asks humans to evaluate LLM evaluator outputs and select the evaluation that better aligns with human judges . there are many other interesting details in the paper, I like this approach, it can be used to scale up the evaluation, but still keeping it controlled and aligned with the needs of evaluation for the business (under many assumptions, need to be tested in the real world) (EvalGen is in ChainForge https://chainforge.ai/ a useful open source LLM engineering tool on its own )
Also, their description of phenomena they call “concept drift” is very valuable
I observed this phenomena frequently in many complex grading tasks, the full evaluation picture becomes clear as you do the evaluation. It’s hard to predict in advance all criteria for evaluation
quote: “We observed a “catch-22” situation: to grade outputs, people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria. We dub this phenomenon criteria drift, and it implies that it is impossible to completely determine evaluation criteria prior to human judging of LLM outputs. Even when participants graded first, we observed that they still refined their criteria upon further grading, even going back to change previous grades. Thus, our findings suggest that users need evaluation assistants to support rapid iteration over criteria and implementations simultaneously. Since criteria are dependent upon LLM outputs (and not independent from them), this raises questions about how to contend with criteria drift in the context of other “drifts” — e.g., model drift [4], prompt edits, or upstream changes in a chain. Our findings also (i) underscore the necessity of mixed-initiative approaches to the alignment of LLM-assisted evaluations that also embrace messiness and iteration, and (ii) raise broader questions about what “alignment with user preferences” means for evaluation assistants.”

https://lnkd.in/gSh4MKGk

--

--

Andrei Lopatenko
Andrei Lopatenko

Written by Andrei Lopatenko

VP Engineering in Zillow. Leading Search, Conversational, Voice AI, ML in Zillow, eBay, Walmart, Apple, Google, Recruit Holdings. Ph.D. in Computer Science

No responses yet