Anthropic

Anthropic Claude 3.5 Sonnet vs GPT-4o vs Gemini 1.5 Pro: Benchmark

Published

10 months ago

June 21, 2024

Anthropic has released Claude 3.5 Sonnet, its fastest-ever large language model (LLM) to compete against OpenAI GPT-4o, and Google’s Gemini 1.5 Pro.

This is the first model from the Claude 3.5 family and will be followed by two more models later this year. Let’s check the benchmark comparison detailed below.

Graduate Level Reasoning (GPQA, Diamond)

Claude 3.5 Sonnet scored 59.4% (0-shot CoT)
Claude 3 Opus scored 50.4% (0-shot CoT)
GPT-4o scored 53.6% (0-shot CoT)
Gemini 1.5 Pro scored 53.6% (0-shot CoT)
Llama-400b scored 53.6% (0-shot CoT)

Undergraduate Level Knowledge (MMLU)

Claude 3.5 Sonnet scored 88.7% (5-shot), 88.3% (0-shot CoT)
Claude 3 Opus scored 86.8% (5-shot CoT), 85.7% (0-shot CoT)
GPT-4o scored 86.8% (5-shot CoT), 87.7% (0-shot CoT)
Gemini 1.5 Pro scored 85.9% (5-shot CoT), 88.7% (0-shot CoT)
Llama-400b scored 86.1% (5-shot CoT), 88.7% (0-shot CoT)

Code – Human Eval

Claude 3.5 Sonnet achieved 92.0% (0-shot CoT)
Claude 3 Opus achieved 84.9% (0-shot CoT)
GPT-4o achieved 90.2% (0-shot CoT)
Gemini 1.5 Pro achieved 84.1% (0-shot CoT)
Llama-400b achieved 84.1% (0-shot CoT)

Multilingual Math – MGSM

Claude 3.5 Sonnet scored 91.6% (0-shot CoT)
Claude 3 Opus scored 90.7% (0-shot CoT)
GPT-4o scored 90.5% (0-shot CoT)
Gemini 1.5 Pro scored 87.5% (8-shot)
Llama-400b scored 87.5% (8-shot)

Reasoning over text (DROP, F1 Score)

Claude 3.5 Sonnet scored 87.1% (3-shot)
Claude 3 Opus scored 83.1% (3-shot)
GPT-4o scored 83.4% (3-shot)
Gemini 1.5 Pro scored 74.9% (variable shots)
Llama-400b scored 83.5% (3-shot, Pre-trained models)

Mixed evaluations (BIG-Bench-Hard)

Claude 3.5 Sonnet scored 93.1% (3-shot COT)
Claude 3 Opus scored 86.8% (3-shot COT)
GPT-4o scored 86.8% (3-shot COT)
Gemini 1.5 Pro scored 89.2% (3-shot COT)
Llama-400b scored 85.3% (3-shot, Pre-trained models)

Mixed evaluations (BIG-Bench-Hard)

Claude 3.5 Sonnet scored 71.1% (0-shot COT)
Claude 3 Opus scored 60.1% (0-shot COT)
GPT-4o scored 76.6% (0-shot COT)
Gemini 1.5 Pro scored 67.7% (4-shot COT)
Llama-400b scored 57.8% (4-shot COT)

Grade school math (GSM8K)

Claude 3.5 Sonnet scored 96.4% (0-shot COT)
Claude 3 Opus scored 95.0% (0-shot COT)
GPT-4o scored 95.0% (0-shot COT)
Gemini 1.5 Pro scored 98.8% (11-shot COT)
Llama-400b scored 94.1% (8-shot COT)

Among these benchmark tests performed by Anthropic, the new model has scored many of the new advancements compared to its competitors.

3.5 Sonnet is available for free on Claude.ai web platform and you can download the Claude iOS app for easy access. Furthermore, the Claude Pro and Team plan subscribers can gain significant speed and rate limits.

Related Topics:AI Anthropic Claude Gemini Google GPT News OpenAI