Show CI
32 models Ranked by Arena Score (Bradley-Terry model) Pairwise comparisons from community votes Updated Feb 27, 2026
# Model Organization Arena Score 95% CI Votes Win Rate Trend License
1
OpenAI · gpt-4o-2025-05
OpenAI
1312
+8 / −7 284k
71.4%
↑2 Proprietary
2
Anthropic · claude-3-7-sonnet
Anthropic
1298
+6 / −6 201k
68.9%
↑1 Proprietary
3
Google DeepMind · gemini-2.0-ultra
Google DeepMind
1287
+9 / −8 178k
66.2%
↓1 Proprietary
4
xAI · grok-3
xAI
1271
+11 / −10 142k
64.7%
↑3 Proprietary
5
Meta AI · llama-3.3-405b
Meta AI
1253
+14 / −13 98k
61.3%
Open Weights
6
Mistral AI · mistral-large-2
Mistral AI
1238
+10 / −9 86k
59.1%
↓2 Open Weights
7
Command R+ PRELIM
Cohere · command-r-plus
Cohere
1214
+18 / −16 41k
55.6%
↑1 Proprietary
8
DeepSeek · deepseek-v3
DeepSeek
1198
+12 / −11 73k
53.2%
↑4 Open Source

Methodology

Bradley-Terry Model

Arena Scores are computed using the Bradley-Terry model — a maximum likelihood estimator for pairwise comparisons. Every community vote contributes to the model's score. Higher scores indicate consistently preferred responses.

Confidence

95% Confidence Interval

The ± values represent the 95% CI computed from 1,000 bootstrap samples of all recorded votes. A narrow CI means the ranking is stable. Models with few votes show wider intervals and are marked as Preliminary.

Fairness

Blind Evaluation

During community puzzle battles, model identities are hidden until after a vote is cast. This prevents bias and ensures rankings reflect genuine capability differences rather than brand familiarity.

Challenge the Leaderboard

Contribute a puzzle dataset and watch how the top LLMs perform. The best datasets get featured and directly influence rankings.

Submit a Dataset → View Puzzles