ProfBench: Over 7,000 brand-new expert-authored response–criterion pairs across 80 professional tasks across PhD STEM (Chemistry, Physics) and MBA Services (Finance, Consulting) domains.
ProfBench is a high-quality, text-only dataset that represent the complex reasoning tasks faced by professionals in fields like finance and chemistry. We're not talking about simple Q&A or retrieval-based tasks. We're talking about multi-page assignments that require deep domain knowledge and reasoning. Can AI generate comprehensive reports by applying the nuanced reasoning that a PhD-level physicist/chemist or an MBA-level consultant/financier would have?
Blog | Paper | Data | Code | Nemo Evaluator SDK
Want to see your favorite models added? Run it with Nemo Evaluator SDK for scalable evaluation or ProfBench code for quick evaluation, send us the scores or ping zhilinw/viviennez [at] nvidia.com to run it for you!
Report Generation Leaderboard: LLMs generate reports with just the prompt, which are then evaluated by gpt-oss-120b (mixed) judge with the lite dataset (160 samples) Evaluation and cost estimation last performed on 12 Nov 2025.
10 | DeepSeek-AI/DeepSeek-V3.2-Exp (Thinking) | Closed-source Reasoning | 52.4 | 38.6 | 57.2 | 44.1 | 69.8 | 45.9 | 54.1 | 59.2 | 18559 | 1126 | 16123 | 0.30000000000000004 |
LLM Judge Leaderboard: LLM Judges are evaluated based on whether they can accurately predict the human-labelled criterion fulfilment across 3 different models (o3, Grok4, R1-0528). We consider not only macro-F1 across 3486 samples but also whether LLM-Judge display bias towards/against any models using a Bias Index. The Overall score is calculated based on Overall F1 - Bias Index. Evaluation and cost estimation last performed on 20 Sep 2025.
10 | Google/Gemini-2.5-Flash-Lite (Thinking) | Closed-source Reasoning | 78.2 | 89.5 | 68.9 | 72.2 | 79.7 | 79.7 | 76.9 | 80.8 | 78.7 | 0.30000000000000004 | -0.30000000000000004 | -0.30000000000000004 | 0.30000000000000004 | 1683 | 1245 | 0.35000000000000003 |
1 | OpenAI/gpt-oss-120b (mixed) | Open-weight Reasoning | 78.2 | 89.5 | 68.9 | 72.2 | 79.7 | 79.7 | 76.9 | 80.8 | 78.7 | -0.5 | -0.9 | -1.0 | 0.5 | 1683 | 282 | 0.70 |
2 | Google/Gemini-2.5-Pro | Closed-source Reasoning | 78.2 | 87.3 | 70.2 | 71.9 | 82.6 | 81.3 | 77.4 | 76.8 | 79.2 | 3.1 | 2.8 | 2.1 | 1.0 | 1779 | 967 | 41.46 |
3 | Google/Gemini-2.5-Flash (Thinking) | Closed-source Reasoning | 78.1 | 87.0 | 68.7 | 71.6 | 81.2 | 80.1 | 76.7 | 74.6 | 78.4 | 2.3 | 2.5 | 2.2 | 0.3 | 1779 | 695 | 7.92 |
4 | OpenAI/o4-mini (low) | Closed-source Reasoning | 76.8 | 88.6 | 70.1 | 70.1 | 81.0 | 78.8 | 76.8 | 74.1 | 78.5 | 3.4 | 3.3 | 1.7 | 1.7 | 1618 | 104 | 7.80 |
5 | OpenAI/GPT-5 (med) | Closed-source Reasoning | 76.7 | 89.2 | 67.9 | 69.0 | 80.9 | 78.1 | 76.3 | 77.3 | 77.9 | 0.0 | -0.9 | -1.2 | 1.2 | 1619 | 287 | 17.06 |
6 | OpenAI/gpt-oss-120b (low) | Open-weight Reasoning | 76.7 | 86.0 | 67.2 | 72.1 | 79.0 | 79.2 | 75.7 | 72.4 | 77.3 | -1.0 | -1.6 | -1.5 | 0.6 | 1683 | 84 | 0.50 |
7 | DeepSeek-AI/DeepSeek-V3.1 (Thinking) | Open-weight Reasoning | 76.6 | 84.3 | 69.3 | 70.8 | 80.3 | 78.9 | 75.6 | 72.0 | 77.3 | 3.2 | 3.3 | 2.6 | 0.7 | 1587 | 657 | 2.94 |
8 | Qwen/Qwen3-235B-A22B-Thinking-2507 | Open-weight Reasoning | 76.5 | 87.2 | 67.9 | 69.0 | 80.4 | 79.3 | 75.6 | 74.3 | 77.3 | -1.0 | -1.8 | -1.5 | 0.8 | 1782 | 1245 | 1.84 |
9 | OpenAI/o3 (high) | Closed-source Reasoning | 76.4 | 88.3 | 68.2 | 69.3 | 81.1 | 79.1 | 76.1 | 75.3 | 77.9 | 2.0 | 0.5 | 0.8 | 1.5 | 1618 | 350 | 21.04 |
10 | OpenAI/o3 (low) | Closed-source Reasoning | 76.4 | 88.9 | 69.3 | 70.3 | 81.9 | 79.7 | 76.8 | 76.7 | 78.7 | 3.8 | 1.5 | 2.6 | 2.3 | 1618 | 98 | 14.01 |
11 | OpenAI/GPT-5 (low) | Closed-source Reasoning | 76.3 | 88.6 | 69.3 | 69.0 | 80.9 | 78.1 | 76.6 | 79.4 | 78.1 | 0.3 | -1.5 | -1.4 | 1.8 | 1618 | 130 | 11.58 |
12 | OpenAI/o3 (med) | Closed-source Reasoning | 76.0 | 89.3 | 69.1 | 68.9 | 81.0 | 79.3 | 76.4 | 76.9 | 78.2 | 3.0 | 0.8 | 1.5 | 2.2 | 1618 | 207 | 17.05 |
13 | OpenAI/GPT-5 (high) | Closed-source Reasoning | 76.0 | 90.2 | 68.2 | 69.4 | 80.9 | 78.3 | 76.7 | 79.1 | 78.3 | 1.0 | -0.8 | -1.3 | 2.3 | 1618 | 668 | 30.34 |
14 | xAI/grok-4 | Closed-source Reasoning | 75.9 | 86.1 | 68.5 | 70.7 | 80.8 | 78.5 | 76.3 | 75.2 | 77.7 | 0.7 | 2.5 | 1.8 | 1.8 | 1549 | 812 | 58.70 |
15 | OpenAI/gpt-oss-120b (med) | Open-weight Reasoning | 75.8 | 88.1 | 67.4 | 70.5 | 79.9 | 79.6 | 76.0 | 75.3 | 77.7 | 0.6 | -1.3 | -0.9 | 1.9 | 1683 | 196 | 0.63 |
16 | OpenAI/o4-mini (med) | Closed-source Reasoning | 75.8 | 88.1 | 69.6 | 70.8 | 81.6 | 78.9 | 76.8 | 74.1 | 78.6 | 4.0 | 2.8 | 1.2 | 2.8 | 1618 | 228 | 9.70 |
17 | OpenAI/o4-mini (high) | Closed-source Reasoning | 75.8 | 88.5 | 68.9 | 70.5 | 81.5 | 78.7 | 76.8 | 76.5 | 78.4 | 4.5 | 2.7 | 1.9 | 2.6 | 1618 | 308 | 10.93 |
18 | OpenAI/gpt-oss-20b (low) | Open-weight Reasoning | 75.6 | 85.4 | 69.3 | 70.8 | 79.2 | 77.6 | 76.3 | 71.1 | 77.5 | 0.4 | -0.3 | 1.6 | 1.9 | 1677 | 85 | 0.28 |
19 | OpenAI/gpt-oss-120b (high) | Open-weight Reasoning | 75.4 | 89.5 | 68.9 | 69.7 | 80.8 | 78.9 | 76.7 | 80.8 | 78.4 | 1.6 | -1.4 | 0.3 | 3.0 | 1683 | 439 | 0.88 |
20 | OpenAI/GPT-4.1 | Closed-source Instruct | 75.4 | 80.9 | 69.2 | 71.0 | 80.0 | 79.8 | 74.4 | 65.8 | 76.3 | 5.5 | 4.6 | 5.0 | 0.9 | 1619 | 1 | 11.31 |
21 | OpenAI/GPT-5-mini (high) | Closed-source Reasoning | 75.3 | 84.5 | 69.2 | 70.4 | 82.8 | 78.4 | 75.9 | 74.1 | 77.7 | 6.6 | 4.2 | 4.6 | 2.4 | 1619 | 497 | 4.88 |
22 | MoonshotAI/Kimi-K2-Instruct-0711 | Open-weight Instruct | 75.2 | 85.3 | 69.5 | 68.3 | 82.3 | 80.3 | 76.1 | 66.4 | 77.6 | 7.1 | 6.1 | 4.7 | 2.4 | 1636 | 1 | 0.81 |
23 | Qwen/Qwen3-235B-A22B-Instruct-2507 | Open-weight Instruct | 75.1 | 86.5 | 69.3 | 69.3 | 79.6 | 79.2 | 76.0 | 64.6 | 77.3 | 3.8 | 2.2 | 1.6 | 2.2 | 1779 | 1 | 0.48 |
24 | xAI/grok-3-mini | Closed-source Reasoning | 75.1 | 85.8 | 66.9 | 69.4 | 82.0 | 78.1 | 75.3 | 75.2 | 77.2 | 4.5 | 2.4 | 2.9 | 2.1 | 1549 | 633 | 2.72 |
25 | OpenAI/GPT-4.1-mini | Closed-source Instruct | 74.9 | 83.9 | 67.3 | 69.1 | 80.6 | 79.2 | 74.7 | 69.8 | 76.4 | -0.2 | 1.2 | -0.3 | 1.5 | 1619 | 1 | 2.26 |
26 | OpenAI/gpt-oss-20b (medium) | Open-weight Reasoning | 74.8 | 87.7 | 68.3 | 69.7 | 80.9 | 78.5 | 76.3 | 76.2 | 77.8 | 3.6 | 1.1 | 0.6 | 3.0 | 1683 | 216 | 0.35 |
27 | MoonshotAI/Kimi-K2-Instruct-0905 | Open-weight Instruct | 74.7 | 84.5 | 69.9 | 67.5 | 81.9 | 80.2 | 75.5 | 65.9 | 77.0 | 7.5 | 6.1 | 5.2 | 2.3 | 1623 | 1 | 0.81 |
28 | OpenAI/GPT-5-mini (low) | Closed-source Reasoning | 74.7 | 82.9 | 68.5 | 70.3 | 81.7 | 77.4 | 74.6 | 78.0 | 76.8 | 5.9 | 3.8 | 4.6 | 2.1 | 1618 | 92 | 2.05 |
29 | Google/Gemini-2.5-Flash-Lite (Thinking) | Closed-source Reasoning | 74.7 | 83.7 | 67.0 | 72.2 | 81.9 | 78.7 | 75.9 | 79.1 | 77.5 | -1.1 | 0.2 | -2.6 | 2.8 | 1779 | 1670 | 2.95 |
30 | OpenAI/GPT-5-mini (med) | Closed-source Reasoning | 74.4 | 83.3 | 68.2 | 69.9 | 81.5 | 78.1 | 74.6 | 72.8 | 76.7 | 6.3 | 4.0 | 4.3 | 2.3 | 1618 | 228 | 3.00 |
31 | OpenAI/gpt-oss-20b (high) | Open-weight Reasoning | 74.4 | 89.3 | 68.7 | 68.5 | 80.7 | 77.8 | 76.5 | 77.7 | 77.9 | 3.3 | -0.2 | 0.9 | 3.5 | 1679 | 465 | 0.46 |
32 | meta/llama-3.3-70b-instruct | Open-weight Instruct | 74.1 | 84.6 | 66.5 | 71.6 | 79.1 | 78.1 | 75.4 | 64.6 | 76.7 | -3.1 | -0.8 | -3.4 | 2.6 | 1628 | 1 | 0.22 |
33 | OpenAI/GPT-5-nano (low) | Closed-source Reasoning | 73.6 | 83.5 | 67.6 | 68.6 | 77.7 | 76.9 | 73.5 | 70.9 | 75.4 | 2.4 | 0.6 | 1.9 | 1.8 | 1619 | 141 | 0.48 |
34 | Google/Gemini-2.5-Flash | Closed-source Instruct | 73.4 | 82.9 | 67.3 | 70.8 | 79.6 | 79.2 | 74.5 | 67.7 | 76.3 | -4.2 | -6.6 | -7.1 | 2.9 | 1779 | 1 | 1.87 |
35 | Google/Gemini-2.5-Flash-Lite | Closed-source Instruct | 73.3 | 83.6 | 68.2 | 68.2 | 80.6 | 77.9 | 75.0 | 71.0 | 76.4 | -1.1 | 2.0 | 0.6 | 3.1 | 1779 | 1 | 0.62 |
36 | Qwen/Qwen3-30B-A3B-instruct-2507 | Open-weight Instruct | 73.1 | 82.0 | 68.3 | 67.3 | 79.7 | 76.5 | 74.5 | 64.7 | 75.5 | 4.7 | 7.1 | 5.3 | 2.4 | 1778 | 1 | 0.32 |
37 | DeepSeek-AI/DeepSeek-V3.1 | Open-weight Instruct | 72.8 | 79.6 | 68.2 | 68.3 | 78.7 | 77.4 | 73.9 | 65.8 | 75.2 | 0.2 | -1.5 | -2.2 | 2.4 | 1586 | 1 | 1.11 |
38 | OpenAI/GPT-5-nano (med) | Closed-source Reasoning | 72.7 | 85.6 | 67.0 | 68.7 | 79.7 | 77.1 | 74.3 | 78.3 | 76.4 | 3.4 | -0.3 | 1.7 | 3.7 | 1618 | 479 | 0.95 |
39 | DeepSeek-AI/DeepSeek-V3-0324 | Open-weight Instruct | 72.6 | 84.5 | 68.0 | 67.0 | 78.3 | 77.7 | 74.6 | 63.5 | 75.7 | 1.5 | 2.4 | -0.7 | 3.1 | 1585 | 1 | 1.11 |
40 | anthropic/claude-3.5-haiku | Closed-source Instruct | 72.5 | 78.9 | 67.2 | 71.2 | 76.7 | 76.9 | 73.3 | 65.4 | 74.9 | -1.7 | 0.7 | -1.4 | 2.4 | 1913 | 1 | 5.35 |
41 | OpenAI/GPT-5 (minimal) | Closed-source Reasoning | 71.9 | 86.8 | 68.6 | 71.2 | 77.5 | 78.9 | 75.2 | 64.8 | 77.0 | -0.5 | -5.6 | -5.0 | 5.1 | 1618 | 7 | 7.29 |
42 | OpenAI/GPT-5-nano (high) | Closed-source Reasoning | 71.9 | 86.8 | 67.6 | 68.7 | 79.8 | 77.6 | 75.1 | 74.0 | 76.9 | 5.3 | 0.3 | 3.1 | 5.0 | 1618 | 1309 | 2.11 |
43 | meta/llama-3.1-405b-instruct | Open-weight Instruct | 71.6 | 85.1 | 69.1 | 67.6 | 81.7 | 77.7 | 75.5 | 65.5 | 77.0 | 11.5 | 6.1 | 9.4 | 5.4 | 1628 | 1 | 4.54 |
44 | Anthropic/claude-sonnet-4-20250514 | Closed-source Reasoning | 70.9 | 75.7 | 66.3 | 69.9 | 77.8 | 77.5 | 72.3 | 66.0 | 74.0 | -11.2 | -8.1 | -10.7 | 3.1 | 1940 | 810 | 62.64 |
45 | meta/llama-3.1-70b-instruct | Open-weight Instruct | 70.7 | 82.1 | 66.7 | 72.6 | 76.0 | 77.5 | 73.9 | 64.7 | 75.4 | -6.2 | -1.5 | -4.1 | 4.7 | 1628 | 1 | 0.22 |
46 | Anthropic/claude-sonnet-4 | Closed-source Instruct | 70.2 | 85.0 | 66.9 | 68.1 | 76.3 | 77.6 | 73.3 | 64.1 | 75.2 | -6.5 | -5.2 | -10.2 | 5.0 | 1913 | 1 | 20.06 |
47 | DeepSeek-AI/DeepSeek-R1-0528 | Open-weight Reasoning | 69.4 | 79.6 | 65.1 | 68.5 | 71.6 | 74.7 | 70.9 | 64.1 | 72.2 | -11.6 | -9.3 | -8.8 | 2.8 | 1601 | 693 | 3.05 |
48 | nvidia/llama-3.3-nemotron-super-49b-v1 | Open-weight Instruct | 68.8 | 77.2 | 65.1 | 70.2 | 72.1 | 74.1 | 70.7 | 64.1 | 72.3 | -15.7 | -12.2 | -13.0 | 3.5 | 1637 | 1 | 0.74 |
49 | meta/llama-4-maverick-17b-128e-instruct | Open-weight Instruct | 67.9 | 64.9 | 66.7 | 73.4 | 76.4 | 76.5 | 70.4 | 67.9 | 72.4 | -14.3 | -10.5 | -9.8 | 4.5 | 1566 | 1 | 0.82 |
50 | nvidia/llama-3.1-nemotron-ultra-253b-v1 | Open-weight Instruct | 67.4 | 84.8 | 63.6 | 66.6 | 61.8 | 72.6 | 67.8 | 57.8 | 69.6 | -10.0 | -11.4 | -9.2 | 2.2 | 1637 | 1 | 3.43 |
51 | OpenAI/GPT-5-mini (minimal) | Closed-source Reasoning | 66.7 | 81.7 | 64.0 | 69.1 | 76.0 | 75.9 | 72.5 | 58.8 | 73.8 | -4.0 | -6.2 | -11.1 | 7.1 | 1618 | 7 | 1.46 |
52 | meta/llama-4-scout-17b-16e-instruct | Open-weight Instruct | 65.9 | 60.4 | 69.4 | 71.3 | 75.6 | 76.2 | 69.9 | 62.0 | 71.8 | -14.5 | -10.2 | -8.6 | 5.9 | 1565 | 1 | 0.44 |
53 | meta/llama-3.1-8b-instruct | Open-weight Instruct | 63.1 | 76.2 | 69.3 | 70.2 | 71.0 | 76.6 | 71.5 | 61.7 | 73.2 | -4.0 | 6.1 | -1.5 | 10.1 | 1628 | 1 | 0.09 |
54 | meta/llama-3.2-3b-instruct | Open-weight Instruct | 58.3 | 67.6 | 63.8 | 59.7 | 66.1 | 68.8 | 64.6 | 54.6 | 66.2 | 8.8 | 16.7 | 13.1 | 7.9 | 1628 | 1 | 0.02 |
55 | nvidia/llama-3.1-nemotron-nano-8b-v1 | Open-weight Instruct | 55.8 | 56.5 | 59.5 | 57.3 | 56.7 | 61.3 | 58.6 | 59.1 | 59.3 | -28.5 | -26.5 | -30.0 | 3.5 | 1633 | 1 | 0.09 |
56 | OpenAI/GPT-5-nano (minimal) | Closed-source Reasoning | 55.0 | 68.8 | 55.3 | 60.9 | 63.0 | 65.8 | 62.1 | 54.3 | 63.2 | -18.7 | -19.6 | -26.9 | 8.2 | 1618 | 7 | 0.29 |
57 | OpenAI/GPT-4.1-nano | Closed-source Instruct | 54.1 | 69.8 | 62.9 | 66.7 | 68.4 | 71.0 | 65.6 | 63.5 | 67.9 | -14.5 | -2.1 | -0.7 | 13.8 | 1619 | 1 | 0.56 |
58 | Qwen/Qwen3-30B-A3B-Thinking-2507 | Open-weight Reasoning | 39.8 | 46.7 | 35.9 | 45.4 | 35.8 | 42.1 | 41.2 | 35.3 | 41.5 | -0.2 | -1.3 | 0.4 | 1.7 | 1780 | 742 | 1.10 |
59 | meta/llama-3.1-1b-instruct | Open-weight Instruct | 39.5 | 31.9 | 48.4 | 44.9 | 55.8 | 47.8 | 43.2 | 46.2 | 45.7 | 31.0 | 33.1 | 37.2 | 6.2 | 1628 | 1 | 0.02 |
Report Generation Leaderboard with Grounding Documents: LLMs generate reports with the human-curated reference documents as context. Results below are based on the full dataset and gpt-oss-120b (mixed) as judge. Evaluation and cost estimation last performed on 20 Sep 2025.
10 | Google/Gemini-2.5-Flash-Lite (Thinking) | Closed-source Reasoning | 65.9 | 49.3 | 70.6 | 63.7 | 76.8 | 64.4 | 66.2 | 65.3 | 12047 | 23758 | 14583 | 0.9500000000000001 |
1 | OpenAI/GPT-5 (high) | Closed-source Reasoning | 65.9 | 49.3 | 70.6 | 63.7 | 80.0 | 64.4 | 66.2 | 65.3 | 5451 | 23758 | 14583 | 112.34 |
2 | OpenAI/o3 | Closed-source Reasoning | 61.4 | 46.1 | 61.8 | 60.9 | 76.8 | 60.4 | 61.8 | 63.0 | 4158 | 18445 | 4709 | 47.72 |
3 | OpenAI/GPT-5-mini (high) | Closed-source Reasoning | 60.3 | 50.8 | 63.6 | 51.6 | 75.4 | 56.7 | 60.1 | 68.2 | 9018 | 26859 | 18038 | 27.39 |
4 | Google/Gemini-2.5-Pro | Closed-source Reasoning | 60.3 | 46.8 | 66.3 | 54.0 | 74.2 | 61.4 | 59.3 | 66.8 | 7449 | 6086 | 7950 | 55.75 |
5 | OpenAI/o4-mini | Closed-source Reasoning | 58.2 | 45.5 | 58.5 | 54.7 | 74.4 | 55.8 | 58.3 | 61.0 | 3886 | 31679 | 4763 | 35.71 |
6 | Google/Gemini-2.5-Flash (Thinking) | Closed-source Reasoning | 57.6 | 45.0 | 61.8 | 53.5 | 69.9 | 58.0 | 57.6 | 61.1 | 12047 | 6086 | 12030 | 20.42 |
7 | Google/Gemini-2.5-Flash | Closed-source Instruct | 56.8 | 44.6 | 59.4 | 54.3 | 68.8 | 57.1 | 56.1 | 53.2 | 21612 | 6086 | 5936 | 10.67 |
8 | OpenAI/GPT-4.1 | Closed-source Instruct | 56.8 | 44.7 | 55.2 | 54.0 | 73.2 | 56.7 | 56.7 | 58.4 | 6451 | 18427 | 2152 | 34.60 |
9 | Anthropic/claude-sonnet-4 (Thinking) | Closed-source Reasoning | 55.8 | 43.9 | 57.1 | 50.8 | 71.4 | 53.8 | 54.0 | 61.8 | 3866 | 51044 | 6916 | 164.39 |
10 | OpenAI/gpt-oss-120b (high) | Open-weight Reasoning | 54.9 | 49.1 | 55.3 | 45.5 | 69.4 | 48.7 | 55.5 | 59.0 | 7442 | 11606 | 4572 | 1.35 |
11 | Qwen/Qwen3-235B-A22B-Instruct-2507 | Open-weight Instruct | 54.2 | 45.6 | 55.8 | 45.7 | 69.6 | 51.0 | 52.9 | 66.2 | 11400 | 12450 | 4244 | 1.47 |
12 | Qwen/Qwen3-235B-A22B-Thinking-2507 | Open-weight Reasoning | 54.0 | 45.1 | 61.4 | 42.3 | 67.3 | 51.4 | 51.6 | 61.9 | 6046 | 12442 | 9256 | 2.47 |
13 | DeepSeek-AI/DeepSeek-V3.1 (Thinking) | Open-weight Reasoning | 53.8 | 44.8 | 59.8 | 43.3 | 67.4 | 51.1 | 53.0 | 60.5 | 5239 | 11258 | 7486 | 5.27 |
14 | OpenAI/GPT-4.1-mini | Closed-source Instruct | 53.7 | 45.1 | 53.0 | 49.1 | 67.5 | 50.3 | 53.2 | 52.8 | 6921 | 29469 | 2218 | 9.82 |
15 | Anthropic/claude-sonnet-4 | Closed-source Instruct | 53.5 | 40.7 | 54.2 | 49.5 | 69.6 | 55.3 | 51.1 | 54.2 | 4068 | 51016 | 1398 | 111.37 |
16 | DeepSeek-AI/DeepSeek-V3.1 | Open-weight Instruct | 53.5 | 45.8 | 55.9 | 45.2 | 67.1 | 50.8 | 52.7 | 59.1 | 7792 | 11231 | 2407 | 2.67 |
17 | xAI/grok-4-0709 | Closed-source Reasoning | 53.4 | 33.6 | 62.2 | 44.3 | 73.4 | 51.9 | 51.6 | 64.1 | 5380 | 13481 | 9885 | 122.78 |
18 | MoonshotAI/Kimi-K2-Instruct-0905 | Open-weight Instruct | 51.3 | 40.4 | 50.2 | 48.8 | 65.9 | 51.2 | 50.0 | 63.4 | 4817 | 11462 | 1562 | 3.36 |
19 | OpenAI/GPT-5-nano (high) | Closed-source Reasoning | 50.1 | 42.2 | 44.6 | 44.6 | 69.0 | 46.6 | 48.3 | 58.9 | 9796 | 28549 | 25189 | 7.36 |
20 | Google/Gemini-2.5-Flash-Lite (Thinking) | Closed-source Reasoning | 49.4 | 31.7 | 53.1 | 44.6 | 68.0 | 48.3 | 48.8 | 54.0 | 10058 | 6086 | 18584 | 5.15 |
21 | Qwen/Qwen3-30B-A3B-Instruct-2507 | Open-weight Instruct | 49.3 | 41.6 | 47.9 | 42.3 | 65.5 | 44.5 | 48.0 | 59.1 | 11167 | 12490 | 4021 | 0.95 |
22 | OpenAI/gpt-oss-20b (high) | Open-weight Reasoning | 48.4 | 41.4 | 46.5 | 39.8 | 66.0 | 40.9 | 48.2 | 56.2 | 5331 | 11600 | 4705 | 0.75 |
23 | Google/Gemini-2.5-Flash-Lite | Closed-source Instruct | 46.6 | 29.8 | 49.0 | 44.0 | 63.7 | 47.4 | 45.0 | 48.6 | 24167 | 6086 | 7787 | 2.33 |
24 | Qwen/Qwen3-30B-A3B-Thinking-2507 | Open-weight Reasoning | 44.6 | 34.4 | 45.4 | 36.8 | 61.8 | 40.4 | 42.3 | 63.9 | 4757 | 12339 | 9027 | 2.16 |
25 | Meta/llama-4-maverick | Open-weight Instruct | 39.4 | 35.2 | 35.8 | 34.2 | 52.5 | 39.3 | 36.5 | 46.2 | 4223 | 14604 | 1191 | 1.86 |
26 | OpenAI/GPT-4.1-nano | Closed-source Instruct | 39.3 | 24.8 | 40.8 | 33.4 | 58.2 | 34.9 | 38.4 | 53.5 | 6359 | 35561 | 1966 | 2.78 |
27 | meta/llama-4-scout | Open-weight Instruct | 35.4 | 23.4 | 34.6 | 33.4 | 50.3 | 35.1 | 33.3 | 42.3 | 3612 | 16675 | 1039 | 1.05 |
28 | Anthropic/claude-3.5-haiku | Closed-source Instruct | 27.6 | 12.0 | 24.7 | 27.7 | 46.3 | 31.2 | 24.7 | 49.4 | 1784 | 34475 | 576 | 19.13 |