ProfBench: Over 7,000 brand-new expert-authored response–criterion pairs across 80 professional tasks across PhD STEM (Chemistry, Physics) and MBA Services (Finance, Consulting) domains.

ProfBench is a high-quality, text-only dataset that represent the complex reasoning tasks faced by professionals in fields like finance and chemistry. We're not talking about simple Q&A or retrieval-based tasks. We're talking about multi-page assignments that require deep domain knowledge and reasoning. Can AI generate comprehensive reports by applying the nuanced reasoning that a PhD-level physicist/chemist or an MBA-level consultant/financier would have?

Blog | Paper | Data | Code | Nemo Evaluator SDK

Want to see your favorite models added? Run it with Nemo Evaluator SDK for scalable evaluation or ProfBench code for quick evaluation, send us the scores or ping zhilinw/viviennez [at] nvidia.com to run it for you!

Report Generation Leaderboard: LLMs generate reports with just the prompt, which are then evaluated by gpt-oss-120b (mixed) judge with the lite dataset (160 samples) Evaluation and cost estimation last performed on 12 Nov 2025.