Finding Unsaturated Evals

I have compiled a list of unsaturated evals, most of which have up-to-date public leaderboards. Then I created some hard prompts for independent testing. Finally, I discuss the need for more video game and web browsing evals.

Unsaturated evals

CategoryBenchmarkScoreBest model
CodingTerminal-Bench 2.075%GPT-5.3-Codex
CodingSWE-Bench Pro (Public)57%GPT-5.3-Codex
CodingVibe Code Bench41%GPT-5.2
PuzzlesSimpleBench76%Gemini 3 Pro
MathematicsFrontierMath41%GPT-5.2 & Claude Opus 4.6
GeneralArena.ai Leaderboard1.5k EloClaude Opus 4.6
VisionZeroBench19%Gemini 3 Pro
KnowledgeHumanity's Last Exam53%Gemini 3 Deep Think
AgenticRemote Labor Index3.8%Claude Opus 4.5
AgenticVending-Bench 28.0k / 63kClaude Opus 4.6
AgenticGDPval50%GPT-5.2
Long contextMRCR (1M, 8 needles)76%Claude Opus 4.6

Recently saturated evals

CategoryBenchmarkScoreBest model
MathematicsPutnamBench668 / 672Aleph Prover (Logical Intelligence)
Web BrowsingBrowseComp84%Claude Opus 4.6
PuzzlesARC-AGI 285%Gemini 3 Deep Think

Independent testing

The results of my independent testing are shown below. Then, I describe each question in more detail.

Grok4.1 does surprisingly well, given its absence in many public benchmarks.

ModelQ1Q2Q3Q4Q5Q6Avg (excl. Q6)
TopicWebWebHealthEducationIQUsability---
Claude Opus 4.66/1010/127.5/91/10/10/165%
Grok 4.18/109/124/91/10/11/160%
Gemini 3 Pro7/108/125/91/10/10/158%
Claude Sonnet 4.54/104/126.5/91/10/10/149%
GPT-5.29/108/127/90/10/11/147%
Kimi K2.57/106/126/90/10/11/137%
Qwen3-Max3/108/126/90/10/10/133%
GLM-52/100/120/91/10/10/124%
DeepSeek V3.23/100/122/90/10/11/110%

Q1

find 10 benchmarks with public leaderboards where gemini 3 pro and gpt5.2 cannot achieve a 50% accuracy. Provide a link to the leaderboard, and report the score of gemini 3 pro and gpt5.2

The answer to this question is basically the first section of this blog.

Q2

Add two columns called "score" and "best model" at the end and fill them in. the score entries should be rounded to 2 sig figs. 
| Category     | Benchmark                                                                                      |
| ------------ | ---------------------------------------------------------------------------------------------- |
| Coding       | [Terminal-Bench 2.0](https://www.tbench.ai/leaderboard/terminal-bench/2.0)                     |
| Coding       | [SWE-Bench Pro (Public)](https://scale.com/leaderboard/swe_bench_pro_public)           |      
| Coding       | [Vibe Code Bench](https://www.vals.ai/benchmarks/vibe-code)                                    |    
| Puzzles      | [SimpleBench](https://simple-bench.com/)                                                       |
| Mathematics  | [FrontierMath](https://epoch.ai/frontiermath)                                                  | 
| General      | [Arena.ai Leaderboard](https://arena.ai/leaderboard)                                           | 
| Vision       | [ZeroBench](https://zerobench.github.io/)                                                      | 
| Knowledge    | [Humanity's Last Exam](https://scale.com/leaderboard/humanitys_last_exam)                      | 
| Agentic      | [Remote Labor Index](https://scale.com/leaderboard/rli)                                        |
| Agentic      | [Vending-Bench 2](https://andonlabs.com/evals/vending-bench-2)                                 |
| Agentic | [GDPval](https://evals.openai.com/gdpval/leaderboard) |
| Long context | [MRCR (1M, 8 needles)](https://contextarena.ai/?needles=8) |

The answer to this question is also the first section of this blog.

The challenge in this question is that some leaderboards are outdated and the AI has to read the recent blogs on Claude Opus 4.6, Gemini 3 Deep Think, and GPT Codex 5.3 to get the up-to-date numbers.

Q3

list all the factors for increasing or lowering the risk of heart disease in a table and provide the statistical risk impact of each

I have a list of nine factors that I selected beforehand after doing extensive research. A few factors on my list are less well known, making this question challenging.

Q4

design a self-study cs curriculum that can get you a job at google or meta. 

This is subjective, but I'd say I have good taste on this topic.

Q5

from openai import OpenAI

def call_model(input):
    client = OpenAI()
    response = client.responses.create(
        model="gpt-5",
        tools=[{"type": "code_interpreter", "container": {"type": "auto"}}],
        input=input,
    )
    return response.output_text

Write the complete Python code for a new LLM eval, where model calls use the above code. The eval has to satisfy the below constraints 
1. gpt-5 with code_interpreter fails the eval 
2. At least one human can succeed on the eval, given the same inputs, time constraint, and tool (code_interpreter)
3. The human also has to comply with gpt-5's safety and content policies
4. The inputs and outputs are text only; no images or audio

I discovered a prompt that fails all current LLMs. The prompt basically asks the LLM to create an eval that it cannot solve itself. Prompts like this can potentially be used in an RL training algorithm where learning iterates between an attacker and a defender, with continuous self-improvement potential.

Q6

Not a prompt for the LLM.
Check whether the conversation persist after a website refresh,
and whether the website allows many sessions at once.

Sadly, Gemini and Claude have some restrictions here, perhaps to slow per-user usage.

Video games

Improving LLM performance on video games is a great step toward achieving physical intelligence. Simulated environments can be made to mimic real environments, and an algorithm that outperforms other algorithms in the simulated environment is likely to outperform in the real environment as well. Video games challenge LLMs on many fronts, including image understanding, long context, and reasoning.

There are many games in the browser and in mobile app stores, so how do we turn all those games into RL environments? I'm not sure and would be happy for someone to teach me. One thing I'm not a big fan of is developing games just for evaluating AIs because it is time-consuming and the additional value over existing games might be minimal.

A particular game that I care about is Go. How do we develop an LLM learning algorithm that can simultaneously teach the model Go, coding, and math? AlphaZero hard-coded the Monte Carlo tree search algorithm. To make the algorithm more general, the model should decide, through reasoning, which nodes to explore further and what data to learn from. We know it's possible to achieve superhuman-level strength in Go. Now we just have to do it one more time, with a more general algorithm that also works for coding and math. The impact of developing such an algorithm will extend far beyond just Go.

Web browsing

Web browsing is my main use case for LLMs. If one Google search request can give me the answer, I will use Google Search. Otherwise, I will use an LLM. Many of my queries require synthesizing information across 10+ articles, so an LLM can save me a lot of time.

The problem with web browsing is that the only popular benchmark, BrowseComp, is getting saturated by Claude Opus 4.6, and that most other web browsing benchmarks have no up-to-date leaderboards. Worse, I'm not sure whether all APIs support web search. This is one case where the app experience might be more advanced than the API experience.