Thursday, November 14, 2024

Claude 3.5 Sonnet comes out on top in Galileo’s Hallucination Index

The AI company Galileo has just announced its latest Hallucination Index, which is a framework that evaluates 22 leading generative AI models. 

Models are tested using a metric called context adherence, which measures “closed-domain hallucinations: cases where your model said things that were not provided in the context.”

The best performing model overall for RAG, according to the ranking, is Claude 3.5 Sonnet from Anthropic. Galileo said that this model and Anthropic’s other model Claude 3 Opus had near perfect scores, beating out OpenAI’s models, which won last year. 

From a cost perspective, the best performing model was Google’s Gemini 1.5 Flash. And Alibaba’s Qwen2-72B-Instruct was overall the best performing open source model, though in short context RAG tests, Meta’s llama-3-60b-instruct was the best. 

Broken down by context length, the best closed-source model in short context RAG was Claude 3.5 Sonnet, in medium context RAG was Google’s Gemini-1.5-flash-001 (with cost being the tiebreaker with other models that also scored a perfect score), and in large context RAG was again Claude 3.5 Sonnet. 

“In today’s rapidly evolving AI landscape, developers and enterprises face a critical challenge: how to harness the power of generative AI while balancing cost, accuracy, and reliability. Current benchmarks are often based on academic use-cases, rather than real-world applications. Our new Index seeks to address this by testing models in real-world use cases that require the LLMs to retrieve data, a common practice in enterprise AI implementations,” says Vikram Chatterji, CEO and co-founder of Galileo. “As hallucinations continue to be a major hurdle, our goal wasn’t to just rank models, but rather give AI teams and leaders the real-world data they need to adopt the right model, for the right task, at the right price.”


You may also like…

Anthropic’s new Claude 3.5 Sonnet model already competitive with GPT-4o and Gemini 1.5 Pro on multiple benchmarks

Meta’s new Llama 3.1 model competes with GPT-4o and Claude 3.5 Sonnet

Related Articles

Latest Articles