Benchmarks

How we compare to frontier AI models

Win Rate

Record

0-0-5

5 comparisons across 1 queries

Last Updated

Dec 26, 2025

1 of 100 queries evaluated

Head-to-Head Results

Competitor	Win Rate	Record
Claude Sonnet 4.5	0%	0W / 0L / 1T
Claude Opus 4.5	0%	0W / 0L / 1T
GPT-5.2 Pro	0%	0W / 0L / 1T
Gemini 3 Pro	0%	0W / 0L / 1T
Grok 4.1	0%	0W / 0L / 1T

Performance by Category

Methodology

We evaluate Carmenta against frontier models using an LLM-as-judge approach (Arena-Hard style). Each query is sent to Carmenta and competitor models, then an independent judge model performs blind pairwise comparisons. Results reflect real-world performance on tasks spanning everyday questions, research, coding, and more.