Arena Rankings
Community-voted rankings of the world's leading LLMs across reasoning, coding, knowledge, and more. Scores are computed using the Bradley-Terry model from pairwise comparisons.
| # | Model | Arena Score |
|---|---|---|
| 1 |
GPT-4o (2025-05)
HOT
OpenAI · gpt-4o-2025-05
|
|
| 2 |
Anthropic · claude-3-7-sonnet
|
|
| 3 |
Google DeepMind · gemini-2.0-ultra
|
|
| 4 |
xAI · grok-3
|
|
| 5 |
Llama 3.3 405B
NEW
Meta AI · llama-3.3-405b
|
|
| 6 |
Mistral AI · mistral-large-2
|
|
| 7 |
Command R+
PRELIM
Cohere · command-r-plus
|
|
| 8 |
DeepSeek · deepseek-v3
|
Methodology
Arena Scores are computed using the Bradley-Terry model — a maximum likelihood estimator for pairwise comparisons. Every community vote contributes to the model's score. Higher scores indicate consistently preferred responses.
Confidence
The ± values represent the 95% CI computed from 1,000 bootstrap samples of all recorded votes. A narrow CI means the ranking is stable. Models with few votes show wider intervals and are marked as Preliminary.
Fairness
During community puzzle battles, model identities are hidden until after a vote is cast. This prevents bias and ensures rankings reflect genuine capability differences rather than brand familiarity.