LLM Mar 5–10 2024
Pairwise comparison of LLM models, LLM Evaluation
An additional significant contribution is detailed by the team responsible for the LMSys ChatBoat arena, where they elaborate on their methodologies for the pairwise comparison of Large Language Models (LLMs). My opinion: This technique of pairwise comparison is invaluable not only for the assessment of distinct LLMs, as demonstrated in the ChatBoat arena, but also for evaluating various iterations of an LLM as it undergoes development and enhancement. The importance of the statistical methods discussed by the authors in the article cannot be overstated, as they hold the potential to become integral tools for numerous entities engaged in the development of LLMs.
The manuscript elaborates on the application of statistical models, including the Bradley-Terry model, for conducting these comparisons. It is reported, albeit not surprisingly to those familiar with statistical theories of pairwise rankings, that the Bradley-Terry model exhibits superior performance compared to the traditional Elo scoring system in their specific application context.
Furthermore, the authors introduce methodologies aimed at expediting the convergence of rankings and identifying anomalies, tailored specifically for the large-scale, real-world deployment of LLMs. These techniques are anticipated to be instrumental in the creation of bespoke LLM evaluators.
The adoption of pairwise comparison is emerging as a critically needed technology within the field of LLMs, promising further contributions in the future. Beyond merely generating a singular score for global ranking, pairwise comparisons excel in elucidating the nuanced spectrum of differences between two models, providing a more detailed understanding of their relative performances and characteristics.
https://lnkd.in/g9VMNx8v
> SaulLM-7B: A pioneering Large Language Model for Law
This article is highly informative and contributes significantly to the field. The authors have successfully trained a specialized Large Language Model (LLM) for legal terminology, utilizing the Mistral architecture. Distinctively, they initiate the model’s development with comprehensive pretraining from scratch, rather than limiting their approach to mere fine-tuning. This endeavor involved the creation and utilization of a pretraining corpus consisting of 30 billion tokens, highlighting an understanding that fine-tuning alone is inadequate for encapsulating the domain-specific nuances comprehensively.
In addition, the researchers have developed a domain-specific benchmark for evaluating models, an essential tool for systematic assessment within this specialized field. The manuscript details the rigorous process involved in curating the pretraining corpus, including steps such as filtering, composition, deduplication, and a strategy for ongoing pretraining. The authors also provide insights into their instruction tuning process.
However, it is noteworthy that Section 5.2, focusing on implementation details, is relatively brief and would benefit from a more thorough exposition to elucidate the methodologies employed.
The rapid development within the field of LLMs and Generative AI is particularly striking, with many tasks that were considered challenging six months ago now regarded as routine. Two years ago, the process of fine-tuning was perceived as complex and was undertaken by only a few teams outside of major technology companies. A year ago, fine-tuning became more commonplace, with numerous businesses beginning to fine-tune LLMs for their specific needs. Currently, fine-tuning has become a widespread practice. Similarly, six months ago, pretraining was seen as a considerable endeavor. Presently, many teams across various industries engage in pretraining, deriving significant benefits for their business-specific tasks from owning LLMs that are both pretrained and fine-tuned for their specific requirements, indicating that pretraining, too, is becoming a commoditized process.
Evaluation Infrastructures
Great reading from Mozilla AI on how they built one, their learning on massive multi-model multi-task evaluation /
“LLM evaluation at scale with the NeurIPS Large Language Model Efficiency Challenge by “
Yi: Open Foundation Models by 01.AI
This paper provides an exhaustive account of the methodologies the authors employed for training their Large Language Models (LLMs).
There is a noticeable trend of various teams openly discussing their training methodologies, covering both broad strategies and specific technicalities. These discussions range from A) high-level overviews, detailing the use of Databricks for orchestrating data pipelines, preprocessing, and analytics, to employing HuggingFace for its tokenizer capabilities, inference tools, and comprehensive dataset and open-source model libraries, as well as utilizing MosaicML (integrated with Databricks) for optimizing GPU infrastructure, refining the training processes, and configuring LLMs, to B) granular explanations, including detailed accounts of specific transformer architecture implementations, modifications to the AdamW optimizer, among other precise technical adjustments.
In this paper, the authors particularly focus on elucidating the critical decisions made during the training process, such as the choice of tokenizer and the rationale behind employing Grouped Query Attention for their models, which, despite being relatively small at 7 billion parameters, required nuanced architectural considerations. They also delve into their use of Rotary Position Embedding to adeptly manage longer context spans, alongside other significant implementation nuances.
Furthermore, the paper provides insights into the infrastructure setup, evaluation (both model performance and application (chat) level performance, including human and automated evaluations), and the methods employed to broaden the model’s capabilities (long context modeling, vision, and depth upscaling).
https://lnkd.in/guddGfzT
a paper from Microsoft presented at WSDM 2024
The paper is focused on the evaluation of extraction of data from tabular data by LLMs, evaluation of LLM knowledge representing data in tables, and techniques to improve LLM answering for prompts that can be answered with tabular data
There is another interesting issue.
I see RAG-based approaches are trending in the industry. But there are a lot of examples that for many tasks LLM can answer prompts using its internal knowledge better than through the RAG system because data are not presented there, or are incomplete.
The question, for a particular prompt, will an answer from the LLM or the answer from the RAG system driven by LLM be better — is an open question and will have a lot of value in many industrial cases
I do not see many publications to compare internal knowledge answers vs RAG published yet
I stumbled upon a fascinating paper that appears to have flown under the radar (just two citations on Google Scholar, despite it was published in Oct 2023), yet offers profound insights for those keen on grasping the nuances of RLHF, including its interplay with other sciences and methodologies. The paper skillfully situates RLHF within a historical framework, drawing connections to Reinforcement Learning, Control theory, Economics, and Decision theory, among others. One aspect I found particularly enlightening was the exploration of RLHF’s relationship to the Von Neumann–Morgenstern utility theorem, and shedding light on what sets RLHF apart in tackling language issues compared to its applications in other domains.
https://lnkd.in/gWyekAaR
A noteworthy article from Pinterest details the enhancements made to their personalized advertisement recommendation system through the utilization of transformers and various other methodologies. The implementation of transformers has been decisively beneficial across numerous applications of sequential modeling. Their application in predicting improved outcomes for user searches, recommendations, personalizations, and advertisements represents a significant advancement for the technology, potentially unlocking multibillion-dollar values for businesses. https://lnkd.in/g6q5Djjz
The article captures attention by presenting a real-world system that employs transformers alongside other techniques to efficiently serve and prioritize advertisements. Furthermore, it elaborates on the efforts undertaken to address the challenges posed by the increased complexity of new models on the efficiency of service delivery and the implications for infrastructure costs.
In a similar vein, Pinterest has recently published another insightful article that explores the evolution of real-world advertisement systems. This piece charts the progression from traditional Gradient Boosted Decision Trees (GBDT) to Deep & Cross Network (DCN) models, and ultimately to transformers and multi-task learning, highlighting the innovative strides made in the field. https://lnkd.in/gzqTeSgw
Within the dynamic ecosystem of artificial intelligence applications in industry, the ongoing evaluation of Natural Language Generation (NLG) capabilities within Large Language Models (LLMs) stands as a critical operational requirement. Companies leveraging these technologies face the necessity to continuously monitor and optimize LLM performance to keep pace with changing user demands, traffic loads, and operational variables. However, the standard approach of human-centric evaluation poses significant challenges, notably in terms of scalability and cost-effectiveness, prompting the question: Is it feasible to employ LLMs themselves in assessing the NLG outputs of similar models?
A recent academic publication ventures into this inquiry, presenting a focused study on the potential of LLMs to serve as evaluators for NLG tasks. The research highlights the effectiveness of pairwise comparison over pointwise comparison in this context, grounded in the observation that the comparative approach aligns more closely with the inherent nature of many NLG tasks. This methodological preference is not only more objective but also echoes the comparative analysis commonly utilized by human evaluators, providing a solid empirical foundation for the study’s approach.
Addressing the challenge of position bias in LLM-driven evaluations, the authors introduce a novel debiasing technique, rigorously tested across various NLG tasks including summarization, data-to-text generation, and dialogue evaluation. The findings indicate that for LLMs of up to 13 billion parameters, comparative assessment significantly outperforms absolute scoring methods. This approach not only aligns with near state-of-the-art performance levels but also suggests a scalable and efficient solution for automatic LLM evaluation.