Benchmarks
How we compare to frontier AI models
Win Rate
0%
Record
0-0-5
5 comparisons across 1 queries
Last Updated
Dec 26, 2025
1 of 100 queries evaluated
Head-to-Head Results
| Competitor | Record | ||
|---|---|---|---|
Claude Sonnet 4.5 | 0% | 0W / 0L / 1T | |
Claude Opus 4.5 | 0% | 0W / 0L / 1T | |
GPT-5.2 Pro | 0% | 0W / 0L / 1T | |
Gemini 3 Pro | 0% | 0W / 0L / 1T | |
Grok 4.1 | 0% | 0W / 0L / 1T |
Performance by Category
Methodology
We evaluate Carmenta against frontier models using an LLM-as-judge approach (Arena-Hard style). Each query is sent to Carmenta and competitor models, then an independent judge model performs blind pairwise comparisons. Results reflect real-world performance on tasks spanning everyday questions, research, coding, and more.