Apex V1.0
Metrics
Mean score
Results
Performance results of various models on this benchmark
| Paper Title | ||
|---|---|---|
| GPT-5 (High) | 64.2% | - |
| Grok-4 | 61.3% | - |
| Gemini-2.5-Flash (On) | 60.4% | - |
| Gemini-2.5-Pro (On) | 60.1% | - |
| o3-Pro (High) | 60.0% | - |
| o3 (High) | 59.9% | - |
| Qwen-3-235B | 59.8% | - |
| Grok-3 | 59.3% | - |
| DeepSeek-R1 | 57.6% | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |
| GPT-OSS-120B (Medium) | 57.1% | - |
| o4-mini (High) | 56.3% | - |
| Opus-4.1 (On) | 55.3% | - |
| GLM-4.5 | 55.1% | - |
| Sonnet-4 (On) | 54.4% | - |
| Opus-4 (On) | 53.6% | - |
| Kimi-K2-Instruct | 51.1% | - |
| Llama-4-Maverick | 44.7% | - |
| Mistral-Medium-3 | 43.0% | - |
| Gemma-3-27B | 36.6% | - |
| Nova-Pro (CoT) | 36.3% | - |
0 of 23 row(s) selected.