Gradio

ProfBench: Over 7,000 brand-new expert-authored response–criterion pairs across 80 professional tasks across PhD STEM (Chemistry, Physics) and MBA Services (Finance, Consulting) domains.

ProfBench is a high-quality, text-only dataset that represent the complex reasoning tasks faced by professionals in fields like finance and chemistry. We're not talking about simple Q&A or retrieval-based tasks. We're talking about multi-page assignments that require deep domain knowledge and reasoning. Can AI generate comprehensive reports by applying the nuanced reasoning that a PhD-level physicist/chemist or an MBA-level consultant/financier would have?

Blog | Paper | Data | Code | Nemo Evaluator SDK

Want to see your favorite models added? Run it with Nemo Evaluator SDK for scalable evaluation or ProfBench code for quick evaluation, send us the scores or ping zhilinw/viviennez [at] nvidia.com to run it for you!

Report Generation Leaderboard: LLMs generate reports with just the prompt, which are then evaluated by gpt-oss-120b (mixed) judge with the lite dataset (160 samples) Evaluation and cost estimation last performed on 17 Feb 2026.

Open-weight Closed-source Reasoning Instruct


10	Google/Gemini-3-Flash-Preview (Thinking)	Closed-source Reasoning	58.8	44.4	72.4	47.3	70.9	53.7	56.4	60.7	20434	3191	22965	0.30000000000000004


1	Anthropic/claude-opus-4.6 (Thinking)	Closed-source Reasoning	58.8	44.4	72.4	47.3	70.9	53.7	60.0	60.7	20434	531	8029	32.54
2	Anthropic/claude-sonnet-4.6 (Thinking)	Closed-source Reasoning	55.6	41.8	65.4	48.1	67.3	49.6	56.4	55.0	19002	531	7205	17.55
3	OpenAI/GPT-5.2 (xhigh)	Closed-source Reasoning	55.1	42.7	72.6	35.6	69.4	47.3	57.8	75.7	7010	3191	22965	52.34
4	OpenAI/GPT-5.1 (high)	Closed-source Reasoning	54.9	42.7	66.5	42.9	67.6	49.5	58.7	61.6	11627	467	17148	27.53
5	Anthropic/claude-opus-4.5 (Thinking)	Closed-source Reasoning	54.2	40.5	68.5	39.0	68.7	48.6	56.1	65.7	9400	560	17627	70.96
6	Google/Gemini-3-Flash-Preview (Thinking)	Closed-source Reasoning	53.4	37.8	72.1	37.2	66.6	45.9	52.1	68.5	6766	489	9120	4.42
7	OpenAI/o3	Closed-source Reasoning	52.4	38.6	57.2	44.1	69.8	43.0	54.1	59.2	4226	467	5569	7.28
8	Google/Gemini-2.5-Pro	Closed-source Reasoning	52.1	40.4	63.8	36.7	67.5	45.9	53.1	62.6	8492	480	9102	14.66
9	MoonshotAI/Kimi-K2.5 (Thinking)	Open-weight Reasoning	51.8	32.4	65.3	39.7	69.6	43.8	51.9	60.2	6240	466	17300	6.95
10	Google/Gemini-3-Pro-Preview	Closed-source Reasoning	51.7	36.6	66.3	34.4	69.4	46.8	53.2	64.6	5360	479	9131	17.68
11	DeepSeek-AI/DeepSeek-V3.2	Open-weight Instruct	50.5	42.1	57.8	37.1	65.0	43.3	51.7	53.1	8266	456	2755	0.20
12	MoonshotAI/Kimi-K2.5	Open-weight Instruct	50.4	32.4	62.0	37.4	69.7	42.6	50.7	55.6	6173	468	14104	5.68
13	Google/Gemini-3-Flash-Preview	Closed-source Instruct	50.2	35.2	64.4	35.8	65.3	43.1	50.2	68.2	3989	480	1335	0.68
14	OpenAI/gpt-oss-120b (high)	Open-weight Reasoning	50.0	43.6	53.5	35.3	67.6	39.7	51.5	63.4	8657	530	4817	0.31
15	OpenAI/GPT-5 (high)	Closed-source Reasoning	49.4	45.7	62.6	29.9	59.2	37.1	51.7	65.5	5876	467	16123	25.89
16	Google/Gemini-2.5-Flash (Thinking)	Closed-source Reasoning	49.2	35.9	63.9	33.2	63.8	43.6	51.4	57.3	18559	480	12943	5.20
17	MoonshotAI/Kimi-K2-Thinking	Open-weight Reasoning	48.9	34.1	51.4	40.1	69.9	42.4	49.3	63.5	6383	469	14221	5.16
18	xAI/grok-4-fast (Thinking)	Closed-source Reasoning	48.7	34.3	57.9	34.2	68.5	40.4	49.9	64.5	6333	598	11536	0.94
19	DeepSeek-AI/DeepSeek-V3.2-Exp	Open-weight Instruct	48.2	41.0	57.2	32.7	61.7	41.9	49.3	55.5	7671	456	2423	0.17
20	Anthropic/claude-sonnet-4.5	Closed-source Instruct	48.1	35.5	55.9	34.7	66.1	44.8	49.9	51.2	7987	531	2299	5.77
21	Google/Gemini-2.5-Flash	Closed-source Instruct	47.8	37.0	57.6	36.7	60.0	41.8	50.3	53.6	24479	480	6255	2.53
22	OpenAI/o4-mini	Closed-source Reasoning	47.5	34.6	50.1	38.1	67.2	37.2	47.7	60.4	3046	467	4335	3.13
23	Z-AI/GLM-4.7 (Thinking)	Open-weight Reasoning	46.6	29.7	62.6	25.9	68.2	38.6	46.2	54.0	5428	475	20648	4.99
24	DeepSeek-AI/DeepSeek-V3.1	Open-weight Instruct	46.6	37.1	51.7	34.4	63.1	40.8	47.2	55.9	8740	456	2717	0.36
25	OpenAI/GPT-4.1	Closed-source Instruct	46.3	34.3	48.5	36.4	65.9	36.5	47.4	55.7	7386	468	2394	3.21
26	DeepSeek-AI/DeepSeek-V3.1 (Thinking)	Open-weight Reasoning	45.9	32.3	53.9	35.6	61.9	39.4	50.0	59.1	5760	415	6253	0.81
27	xAI/grok-4-fast	Closed-source Instruct	45.9	31.7	54.2	30.5	67.3	40.2	48.7	58.3	6625	598	11733	0.96
28	DeepSeek-AI/DeepSeek-V3.2 (Thinking)	Open-weight Reasoning	45.7	30.0	55.8	33.0	64.1	39.1	45.9	58.2	4346	456	13245	0.87
29	xAI/grok-4-0709	Closed-source Reasoning	45.1	20.6	59.8	29.4	70.5	40.1	48.4	65.2	4977	1126	17957	43.64
30	Qwen/Qwen3.5-397B-A17B	Open-weight Instruct	45.1	38.5	51.6	27.4	62.8	37.1	46.0	61.6	6134	496	11052	6.41
31	Qwen/Qwen3-235B-A22B-Instruct-2507	Open-weight Instruct	45.0	31.9	52.4	32.9	63.0	38.1	48.4	54.5	14314	487	5000	0.45
32	xAI/grok-4.1-fast (Thinking)	Closed-source Reasoning	44.8	27.8	56.3	25.0	70.2	36.9	45.8	61.6	3630	598	11826	0.97
33	Anthropic/claude-sonnet-4.5 (Thinking)	Closed-source Reasoning	44.8	34.5	63.7	27.5	53.3	36.6	45.7	49.3	8356	559	11358	27.53
34	Qwen/Qwen3.5-397B-A17B (Thinking)	Open-weight Reasoning	44.7	34.9	54.9	26.9	62.1	37.8	45.2	64.9	5656	496	11851	6.87
35	xAI/grok-4.1-fast	Closed-source Instruct	44.4	28.0	53.1	28.9	67.5	36.4	44.9	64.0	3906	531	11320	0.92
36	Google/Gemini-2.5-Flash-Lite (Thinking)	Closed-source Reasoning	44.3	32.3	52.7	31.3	61.0	35.8	43.3	56.7	12153	480	17302	1.12
37	MoonshotAI/Kimi-K2-Instruct-0905	Open-weight Instruct	44.2	33.7	47.3	32.2	63.6	36.6	43.2	58.8	5322	481	1709	0.55
38	Z-AI/GLM-5 (Thinking)	Open-weight Reasoning	43.3	22.4	56.6	27.8	66.3	36.1	41.4	53.1	5395	461	19499	10.06
39	OpenAI/GPT-4.1-mini	Closed-source Instruct	42.8	40.5	50.6	23.1	57.0	33.5	43.1	58.7	7550	468	2322	0.62
40	Z-AI/GLM-5	Open-weight Instruct	42.6	25.7	52.2	26.6	65.9	33.2	41.0	53.1	5497	459	17179	8.87
41	Anthropic/claude-sonnet-4 (Thinking)	Closed-source Reasoning	42.5	39.5	53.3	21.2	56.1	29.5	42.5	66.1	3621	559	7924	19.29
42	OpenAI/gpt-oss-20b (high)	Open-weight Reasoning	42.3	33.6	40.5	28.7	66.4	29.9	44.1	59.1	5609	508	5375	0.12
43	MiniMax/MiniMax-M2.5 (Thinking)	Open-weight Reasoning	42.2	29.6	50.5	24.8	63.8	30.2	41.5	62.6	15835	484	20003	3.86
44	Google/Gemini-2.5-Flash-Lite	Closed-source Instruct	41.8	28.8	51.2	25.0	62.1	35.2	41.3	52.1	26746	480	8723	0.57
45	MiniMax/MiniMax-M2.1	Open-weight Instruct	41.6	30.6	43.4	27.6	64.9	32.5	41.7	49.8	17482	486	8518	1.66
46	OpenAI/GPT-5-mini (high)	Closed-source Reasoning	41.6	40.0	59.5	11.8	55.1	32.7	42.1	65.9	7870	956	15280	4.93
47	DeepSeek-AI/DeepSeek-V3.2-Exp (Thinking)	Open-weight Reasoning	41.4	26.2	52.4	26.3	60.8	31.6	41.5	56.4	5242	458	9938	0.66
48	MiniMax/MiniMax-M2.1 (Thinking)	Open-weight Reasoning	41.3	29.4	48.3	26.2	61.2	34.1	42.3	48.8	17288	488	12649	2.45
49	Anthropic/claude-sonnet-4	Closed-source Instruct	41.2	34.8	47.1	18.9	63.9	32.6	40.9	58.3	4047	531	1375	3.55
50	Qwen/Qwen3-235B-A22B-Thinking-2507	Open-weight Reasoning	41.1	32.7	46.4	23.5	61.9	35.0	46.1	63.4	11390	490	5568	0.54
51	MiniMax/MiniMax-M2.5	Open-weight Instruct	40.0	29.6	43.9	22.8	63.5	28.9	39.8	53.6	5547	484	18356	3.55
52	Qwen/Qwen3-30B-A3B-Instruct-2507	Open-weight Instruct	40.0	26.6	47.2	24.9	61.4	31.2	40.1	50.7	10488	487	3654	0.20
53	Anthropic/claude-haiku-4.5 (Thinking)	Closed-source Reasoning	39.6	32.4	48.7	20.9	56.3	29.6	39.4	52.8	9694	559	17673	14.23
54	Z-AI/GLM-4.7	Open-weight Instruct	38.9	20.2	43.7	25.9	65.9	30.4	36.8	49.8	5232	469	17476	4.22
55	Anthropic/claude-haiku-4.5	Closed-source Instruct	38.0	25.6	47.3	18.0	61.3	28.9	38.5	55.0	10776	531	3439	2.84
56	MiniMax/M2	Open-weight Instruct	37.7	25.2	49.5	25.8	50.3	30.1	37.3	50.9	6312	481	8876	0.65
57	Qwen/Qwen3-30B-A3B-Thinking-2507	Open-weight Reasoning	37.6	19.0	44.7	25.6	61.3	29.4	42.4	73.1	5892	469	6376	0.30
58	OpenAI/GPT-5-nano (high)	Closed-source Reasoning	36.9	29.1	37.9	23.6	56.9	26.5	35.6	58.0	8915	467	23008	1.48
59	Meta/llama-4-maverick	Open-weight Instruct	34.4	31.3	35.4	22.1	48.8	27.2	34.6	32.6	4532	479	1292	0.14
60	MiniMax/M2 (Thinking)	Open-weight Reasoning	34.1	22.6	38.1	25.2	50.5	25.4	31.9	45.9	6485	433	14932	1.09
61	OpenAI/GPT-4.1-nano	Closed-source Instruct	33.2	21.4	35.1	23.3	53.2	24.9	32.9	48.3	6198	468	1799	0.12
62	Meta/llama-4-scout	Open-weight Instruct	31.2	19.2	30.0	19.5	55.9	26.2	29.9	38.9	4200	457	1197	0.06
63	Anthropic/claude-3.5-haiku	Closed-source Instruct	21.2	6.9	25.2	8.5	44.3	16.6	19.7	41.7	1618	531	519	0.40

LLM Judge Leaderboard: LLM Judges are evaluated based on whether they can accurately predict the human-labelled criterion fulfilment across 3 different models (o3, Grok4, R1-0528). We consider not only macro-F1 across 3486 samples but also whether LLM-Judge display bias towards/against any models using a Bias Index. The Overall score is calculated based on Overall F1 - Bias Index. Evaluation and cost estimation last performed on 20 Sep 2025.

Open-weight Closed-source Reasoning Instruct


10	Google/Gemini-2.5-Flash-Lite (Thinking)	Closed-source Reasoning	78.2	89.5	68.9	72.2	79.7	79.7	76.9	80.8	78.7	0.30000000000000004	-0.30000000000000004	-0.30000000000000004	0.30000000000000004	1683	1245	0.35000000000000003


1	OpenAI/gpt-oss-120b (mixed)	Open-weight Reasoning	78.2	89.5	68.9	72.2	79.7	79.7	76.9	80.8	78.7	-0.5	-0.9	-1.0	0.5	1683	282	0.70
2	Google/Gemini-2.5-Pro	Closed-source Reasoning	78.2	87.3	70.2	71.9	82.6	81.3	77.4	76.8	79.2	3.1	2.8	2.1	1.0	1779	967	41.46
3	Google/Gemini-2.5-Flash (Thinking)	Closed-source Reasoning	78.1	87.0	68.7	71.6	81.2	80.1	76.7	74.6	78.4	2.3	2.5	2.2	0.3	1779	695	7.92
4	OpenAI/o4-mini (low)	Closed-source Reasoning	76.8	88.6	70.1	70.1	81.0	78.8	76.8	74.1	78.5	3.4	3.3	1.7	1.7	1618	104	7.80
5	OpenAI/GPT-5 (med)	Closed-source Reasoning	76.7	89.2	67.9	69.0	80.9	78.1	76.3	77.3	77.9	0.0	-0.9	-1.2	1.2	1619	287	17.06
6	OpenAI/gpt-oss-120b (low)	Open-weight Reasoning	76.7	86.0	67.2	72.1	79.0	79.2	75.7	72.4	77.3	-1.0	-1.6	-1.5	0.6	1683	84	0.50
7	DeepSeek-AI/DeepSeek-V3.1 (Thinking)	Open-weight Reasoning	76.6	84.3	69.3	70.8	80.3	78.9	75.6	72.0	77.3	3.2	3.3	2.6	0.7	1587	657	2.94
8	Qwen/Qwen3-235B-A22B-Thinking-2507	Open-weight Reasoning	76.5	87.2	67.9	69.0	80.4	79.3	75.6	74.3	77.3	-1.0	-1.8	-1.5	0.8	1782	1245	1.84
9	OpenAI/o3 (high)	Closed-source Reasoning	76.4	88.3	68.2	69.3	81.1	79.1	76.1	75.3	77.9	2.0	0.5	0.8	1.5	1618	350	21.04
10	OpenAI/o3 (low)	Closed-source Reasoning	76.4	88.9	69.3	70.3	81.9	79.7	76.8	76.7	78.7	3.8	1.5	2.6	2.3	1618	98	14.01
11	OpenAI/GPT-5 (low)	Closed-source Reasoning	76.3	88.6	69.3	69.0	80.9	78.1	76.6	79.4	78.1	0.3	-1.5	-1.4	1.8	1618	130	11.58
12	OpenAI/o3 (med)	Closed-source Reasoning	76.0	89.3	69.1	68.9	81.0	79.3	76.4	76.9	78.2	3.0	0.8	1.5	2.2	1618	207	17.05
13	OpenAI/GPT-5 (high)	Closed-source Reasoning	76.0	90.2	68.2	69.4	80.9	78.3	76.7	79.1	78.3	1.0	-0.8	-1.3	2.3	1618	668	30.34
14	xAI/grok-4	Closed-source Reasoning	75.9	86.1	68.5	70.7	80.8	78.5	76.3	75.2	77.7	0.7	2.5	1.8	1.8	1549	812	58.70
15	OpenAI/gpt-oss-120b (med)	Open-weight Reasoning	75.8	88.1	67.4	70.5	79.9	79.6	76.0	75.3	77.7	0.6	-1.3	-0.9	1.9	1683	196	0.63
16	OpenAI/o4-mini (med)	Closed-source Reasoning	75.8	88.1	69.6	70.8	81.6	78.9	76.8	74.1	78.6	4.0	2.8	1.2	2.8	1618	228	9.70
17	OpenAI/o4-mini (high)	Closed-source Reasoning	75.8	88.5	68.9	70.5	81.5	78.7	76.8	76.5	78.4	4.5	2.7	1.9	2.6	1618	308	10.93
18	OpenAI/gpt-oss-20b (low)	Open-weight Reasoning	75.6	85.4	69.3	70.8	79.2	77.6	76.3	71.1	77.5	0.4	-0.3	1.6	1.9	1677	85	0.28
19	OpenAI/gpt-oss-120b (high)	Open-weight Reasoning	75.4	89.5	68.9	69.7	80.8	78.9	76.7	80.8	78.4	1.6	-1.4	0.3	3.0	1683	439	0.88
20	OpenAI/GPT-4.1	Closed-source Instruct	75.4	80.9	69.2	71.0	80.0	79.8	74.4	65.8	76.3	5.5	4.6	5.0	0.9	1619	1	11.31
21	OpenAI/GPT-5-mini (high)	Closed-source Reasoning	75.3	84.5	69.2	70.4	82.8	78.4	75.9	74.1	77.7	6.6	4.2	4.6	2.4	1619	497	4.88
22	MoonshotAI/Kimi-K2-Instruct-0711	Open-weight Instruct	75.2	85.3	69.5	68.3	82.3	80.3	76.1	66.4	77.6	7.1	6.1	4.7	2.4	1636	1	0.81
23	Qwen/Qwen3-235B-A22B-Instruct-2507	Open-weight Instruct	75.1	86.5	69.3	69.3	79.6	79.2	76.0	64.6	77.3	3.8	2.2	1.6	2.2	1779	1	0.48
24	xAI/grok-3-mini	Closed-source Reasoning	75.1	85.8	66.9	69.4	82.0	78.1	75.3	75.2	77.2	4.5	2.4	2.9	2.1	1549	633	2.72
25	OpenAI/GPT-4.1-mini	Closed-source Instruct	74.9	83.9	67.3	69.1	80.6	79.2	74.7	69.8	76.4	-0.2	1.2	-0.3	1.5	1619	1	2.26
26	OpenAI/gpt-oss-20b (medium)	Open-weight Reasoning	74.8	87.7	68.3	69.7	80.9	78.5	76.3	76.2	77.8	3.6	1.1	0.6	3.0	1683	216	0.35
27	MoonshotAI/Kimi-K2-Instruct-0905	Open-weight Instruct	74.7	84.5	69.9	67.5	81.9	80.2	75.5	65.9	77.0	7.5	6.1	5.2	2.3	1623	1	0.81
28	OpenAI/GPT-5-mini (low)	Closed-source Reasoning	74.7	82.9	68.5	70.3	81.7	77.4	74.6	78.0	76.8	5.9	3.8	4.6	2.1	1618	92	2.05
29	Google/Gemini-2.5-Flash-Lite (Thinking)	Closed-source Reasoning	74.7	83.7	67.0	72.2	81.9	78.7	75.9	79.1	77.5	-1.1	0.2	-2.6	2.8	1779	1670	2.95
30	OpenAI/GPT-5-mini (med)	Closed-source Reasoning	74.4	83.3	68.2	69.9	81.5	78.1	74.6	72.8	76.7	6.3	4.0	4.3	2.3	1618	228	3.00
31	OpenAI/gpt-oss-20b (high)	Open-weight Reasoning	74.4	89.3	68.7	68.5	80.7	77.8	76.5	77.7	77.9	3.3	-0.2	0.9	3.5	1679	465	0.46
32	meta/llama-3.3-70b-instruct	Open-weight Instruct	74.1	84.6	66.5	71.6	79.1	78.1	75.4	64.6	76.7	-3.1	-0.8	-3.4	2.6	1628	1	0.22
33	OpenAI/GPT-5-nano (low)	Closed-source Reasoning	73.6	83.5	67.6	68.6	77.7	76.9	73.5	70.9	75.4	2.4	0.6	1.9	1.8	1619	141	0.48
34	Google/Gemini-2.5-Flash	Closed-source Instruct	73.4	82.9	67.3	70.8	79.6	79.2	74.5	67.7	76.3	-4.2	-6.6	-7.1	2.9	1779	1	1.87
35	Google/Gemini-2.5-Flash-Lite	Closed-source Instruct	73.3	83.6	68.2	68.2	80.6	77.9	75.0	71.0	76.4	-1.1	2.0	0.6	3.1	1779	1	0.62
36	Qwen/Qwen3-30B-A3B-instruct-2507	Open-weight Instruct	73.1	82.0	68.3	67.3	79.7	76.5	74.5	64.7	75.5	4.7	7.1	5.3	2.4	1778	1	0.32
37	DeepSeek-AI/DeepSeek-V3.1	Open-weight Instruct	72.8	79.6	68.2	68.3	78.7	77.4	73.9	65.8	75.2	0.2	-1.5	-2.2	2.4	1586	1	1.11
38	OpenAI/GPT-5-nano (med)	Closed-source Reasoning	72.7	85.6	67.0	68.7	79.7	77.1	74.3	78.3	76.4	3.4	-0.3	1.7	3.7	1618	479	0.95
39	DeepSeek-AI/DeepSeek-V3-0324	Open-weight Instruct	72.6	84.5	68.0	67.0	78.3	77.7	74.6	63.5	75.7	1.5	2.4	-0.7	3.1	1585	1	1.11
40	anthropic/claude-3.5-haiku	Closed-source Instruct	72.5	78.9	67.2	71.2	76.7	76.9	73.3	65.4	74.9	-1.7	0.7	-1.4	2.4	1913	1	5.35
41	OpenAI/GPT-5 (minimal)	Closed-source Reasoning	71.9	86.8	68.6	71.2	77.5	78.9	75.2	64.8	77.0	-0.5	-5.6	-5.0	5.1	1618	7	7.29
42	OpenAI/GPT-5-nano (high)	Closed-source Reasoning	71.9	86.8	67.6	68.7	79.8	77.6	75.1	74.0	76.9	5.3	0.3	3.1	5.0	1618	1309	2.11
43	meta/llama-3.1-405b-instruct	Open-weight Instruct	71.6	85.1	69.1	67.6	81.7	77.7	75.5	65.5	77.0	11.5	6.1	9.4	5.4	1628	1	4.54
44	Anthropic/claude-sonnet-4-20250514	Closed-source Reasoning	70.9	75.7	66.3	69.9	77.8	77.5	72.3	66.0	74.0	-11.2	-8.1	-10.7	3.1	1940	810	62.64
45	meta/llama-3.1-70b-instruct	Open-weight Instruct	70.7	82.1	66.7	72.6	76.0	77.5	73.9	64.7	75.4	-6.2	-1.5	-4.1	4.7	1628	1	0.22
46	Anthropic/claude-sonnet-4	Closed-source Instruct	70.2	85.0	66.9	68.1	76.3	77.6	73.3	64.1	75.2	-6.5	-5.2	-10.2	5.0	1913	1	20.06
47	DeepSeek-AI/DeepSeek-R1-0528	Open-weight Reasoning	69.4	79.6	65.1	68.5	71.6	74.7	70.9	64.1	72.2	-11.6	-9.3	-8.8	2.8	1601	693	3.05
48	nvidia/llama-3.3-nemotron-super-49b-v1	Open-weight Instruct	68.8	77.2	65.1	70.2	72.1	74.1	70.7	64.1	72.3	-15.7	-12.2	-13.0	3.5	1637	1	0.74
49	meta/llama-4-maverick-17b-128e-instruct	Open-weight Instruct	67.9	64.9	66.7	73.4	76.4	76.5	70.4	67.9	72.4	-14.3	-10.5	-9.8	4.5	1566	1	0.82
50	nvidia/llama-3.1-nemotron-ultra-253b-v1	Open-weight Instruct	67.4	84.8	63.6	66.6	61.8	72.6	67.8	57.8	69.6	-10.0	-11.4	-9.2	2.2	1637	1	3.43
51	OpenAI/GPT-5-mini (minimal)	Closed-source Reasoning	66.7	81.7	64.0	69.1	76.0	75.9	72.5	58.8	73.8	-4.0	-6.2	-11.1	7.1	1618	7	1.46
52	meta/llama-4-scout-17b-16e-instruct	Open-weight Instruct	65.9	60.4	69.4	71.3	75.6	76.2	69.9	62.0	71.8	-14.5	-10.2	-8.6	5.9	1565	1	0.44
53	meta/llama-3.1-8b-instruct	Open-weight Instruct	63.1	76.2	69.3	70.2	71.0	76.6	71.5	61.7	73.2	-4.0	6.1	-1.5	10.1	1628	1	0.09
54	meta/llama-3.2-3b-instruct	Open-weight Instruct	58.3	67.6	63.8	59.7	66.1	68.8	64.6	54.6	66.2	8.8	16.7	13.1	7.9	1628	1	0.02
55	nvidia/llama-3.1-nemotron-nano-8b-v1	Open-weight Instruct	55.8	56.5	59.5	57.3	56.7	61.3	58.6	59.1	59.3	-28.5	-26.5	-30.0	3.5	1633	1	0.09
56	OpenAI/GPT-5-nano (minimal)	Closed-source Reasoning	55.0	68.8	55.3	60.9	63.0	65.8	62.1	54.3	63.2	-18.7	-19.6	-26.9	8.2	1618	7	0.29
57	OpenAI/GPT-4.1-nano	Closed-source Instruct	54.1	69.8	62.9	66.7	68.4	71.0	65.6	63.5	67.9	-14.5	-2.1	-0.7	13.8	1619	1	0.56
58	Qwen/Qwen3-30B-A3B-Thinking-2507	Open-weight Reasoning	39.8	46.7	35.9	45.4	35.8	42.1	41.2	35.3	41.5	-0.2	-1.3	0.4	1.7	1780	742	1.10
59	meta/llama-3.1-1b-instruct	Open-weight Instruct	39.5	31.9	48.4	44.9	55.8	47.8	43.2	46.2	45.7	31.0	33.1	37.2	6.2	1628	1	0.02

Report Generation Leaderboard with Grounding Documents: LLMs generate reports with the human-curated reference documents as context. Results below are based on the full dataset and gpt-oss-120b (mixed) as judge. Evaluation and cost estimation last performed on 20 Sep 2025.

Open-weight Closed-source Reasoning Instruct


10	Google/Gemini-2.5-Flash-Lite (Thinking)	Closed-source Reasoning	65.9	49.3	70.6	63.7	76.8	64.4	66.2	65.3	12047	23758	14583	0.9500000000000001