Reasoning

Multi-step Logic Arena

Test LLMs on complex multi-step reasoning chains that demand consistent logical deduction across long contexts.

248 submissions
Coding

Code Synthesis Challenge

Evaluate code generation quality across diverse programming tasks contributed by engineers worldwide.

182 submissions
Knowledge

Domain Knowledge Probe

Expert-crafted questions across science, law, medicine and finance to expose the precise limits of LLM knowledge.

317 submissions
Math

Olympiad Math Gauntlet

Competition-level mathematics problems sourced from IMO, AIME, and AMC to rigorously test quantitative reasoning.

94 submissions
Language

Cross-lingual Transfer Test

Multilingual tasks that measure how well LLMs transfer knowledge and reasoning abilities across diverse languages.

138 submissions
Multimodal

Vision-Language Benchmark

Paired image-text challenges that test how well vision-language models align visual understanding with precise language generation.

Coming Soon

How it works

Three steps to contribute and benchmark the world's best LLMs

Step 01

Contribute a Dataset

Submit your carefully crafted puzzle or dataset through our community portal. Q&A, code, reasoning chains — all formats welcome.

Step 02

Run the Benchmark

Your dataset is automatically evaluated across all supported LLMs. Results are scored, ranked, and published in real time.

Step 03

Explore the Leaderboard

Dive into model comparisons, share insights with the community, and watch the best datasets get featured on the front page.

Contribute a Puzzle

Have a challenging dataset or a creative evaluation idea? Join our contributors and help build the world's most community-driven Interactive Benchmark.

Submit Your Dataset →