Leaderboard — Interactive Hack

32 models Ranked by Arena Score (Bradley-Terry model) Pairwise comparisons from community votes Updated Feb 27, 2026

#	Model ↕	Organization ↕	Arena Score ↓	95% CI	Votes ↕	Win Rate ↕	Trend	License
1	GPT-4o (2025-05) HOT OpenAI · gpt-4o-2025-05	OpenAI	1312	+8 / −7	284k	71.4%	↑2	Proprietary
2	Claude 3.7 Sonnet NEW Anthropic · claude-3-7-sonnet	Anthropic	1298	+6 / −6	201k	68.9%	↑1	Proprietary
3	Gemini 2.0 Ultra Google DeepMind · gemini-2.0-ultra	Google DeepMind	1287	+9 / −8	178k	66.2%	↓1	Proprietary
4	Grok-3 xAI · grok-3	xAI	1271	+11 / −10	142k	64.7%	↑3	Proprietary
5	Llama 3.3 405B NEW Meta AI · llama-3.3-405b	Meta AI	1253	+14 / −13	98k	61.3%	—	Open Weights
6	Mistral Large 2 Mistral AI · mistral-large-2	Mistral AI	1238	+10 / −9	86k	59.1%	↓2	Open Weights
7	Command R+ PRELIM Cohere · command-r-plus	Cohere	1214	+18 / −16	41k	55.6%	↑1	Proprietary
8	DeepSeek-V3 DeepSeek · deepseek-v3	DeepSeek	1198	+12 / −11	73k	53.2%	↑4	Open Source

Methodology

Bradley-Terry Model

Arena Scores are computed using the Bradley-Terry model — a maximum likelihood estimator for pairwise comparisons. Every community vote contributes to the model's score. Higher scores indicate consistently preferred responses.

Confidence

95% Confidence Interval

The ± values represent the 95% CI computed from 1,000 bootstrap samples of all recorded votes. A narrow CI means the ranking is stable. Models with few votes show wider intervals and are marked as Preliminary.

Fairness

Blind Evaluation

During community puzzle battles, model identities are hidden until after a vote is cast. This prevents bias and ensures rankings reflect genuine capability differences rather than brand familiarity.

Interactive
Benchmark

Bradley-Terry Model

95% Confidence Interval

Blind Evaluation

Challenge the Leaderboard

InteractiveBenchmark

Bradley-Terry Model

95% Confidence Interval

Blind Evaluation

Challenge the Leaderboard

Interactive
Benchmark