Apex V1.0

Mean score

Results

Performance results of various models on this benchmark

		Paper Title
GPT-5 (High)	64.2%	-
Grok-4	61.3%	-
Gemini-2.5-Flash (On）	60.4%	-
Gemini-2.5-Pro (On)	60.1%	-
o3-Pro (High)	60.0%	-
o3 (High)	59.9%	-
Qwen-3-235B	59.8%	-
Grok-3	59.3%	-
DeepSeek-R1	57.6%	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
GPT-OSS-120B (Medium)	57.1%	-
o4-mini (High)	56.3%	-
Opus-4.1 (On)	55.3%	-
GLM-4.5	55.1%	-
Sonnet-4 (On)	54.4%	-
Opus-4 (On)	53.6%	-
Kimi-K2-Instruct	51.1%	-
Llama-4-Maverick	44.7%	-
Mistral-Medium-3	43.0%	-
Gemma-3-27B	36.6%	-
Nova-Pro (CoT)	36.3%	-

0 of 23 row(s) selected.

Mean score

Performance results of various models on this benchmark

		Paper Title
GPT-5 (High)	64.2%	-
Grok-4	61.3%	-
Gemini-2.5-Flash (On）	60.4%	-
Gemini-2.5-Pro (On)	60.1%	-
o3-Pro (High)	60.0%	-
o3 (High)	59.9%	-
Qwen-3-235B	59.8%	-
Grok-3	59.3%	-
DeepSeek-R1	57.6%	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
GPT-OSS-120B (Medium)	57.1%	-
o4-mini (High)	56.3%	-
Opus-4.1 (On)	55.3%	-
GLM-4.5	55.1%	-
Sonnet-4 (On)	54.4%	-
Opus-4 (On)	53.6%	-
Kimi-K2-Instruct	51.1%	-
Llama-4-Maverick	44.7%	-
Mistral-Medium-3	43.0%	-
Gemma-3-27B	36.6%	-
Nova-Pro (CoT)	36.3%	-

0 of 23 row(s) selected.