Finding Unsaturated Evals

I have compiled a list of unsaturated evals, most of which have up-to-date public leaderboards. Then I created some hard prompts for independent testing. Finally, I discuss the need for more video game and web browsing evals.

Unsaturated evals

Category	Benchmark	Score	Best model
Agentic	Agent's Last Exam	31%	GPT-5.6
Puzzles	SimpleBench	82%	Fable 5
Chess	Chess	1.8k Elo	Gemini 3.0 Pro
Vision	ZeroBench	23%	GPT-5.4 (no GPT-5.6)
Agentic	Vending-Bench 2	11k / $63k	Claude Opus 4.7
Long context	MRCR (512k, 8 needles)	58%	GPT-5.5 (no GPT-5.6)
Coding	SlopCodeBench	14%	GPT 5.5 (no GPT-5.6)
Coding	ProgramBench	0.5%	GPT 5.5 (no GPT-5.6)

Independent testing

Metric	Claude Opus 4.8 Max	Grok 4.5	Gemini 3.1 Pro	GPT-5.6 xhigh	Meta Muse 1.1
health	4/9	6/9	4/9	7/9	6/9
optimizer	1/1	1/1	0/1	1/1	1/1
days to learn ap calc	0/1	0/1	1/1	0/1	0.5/1
days to learn multi calc	0/1	1/1	1/1	0.5/1	1/1
interview prep	1/1	1/1	1/1	0.5/1	1/1
iOS battery icon while charging	0/1	0/1	0/1	0/1	0/1
make LLM Eval	0/1	0/1	0/1	0/1	0/1
tax 1	0/1	0/1	1/1	1/1	0.5/1
immigration 1	0/3	1/3	0/3	0/3	2/3
immigration 2	1/3	1/3	3/3	1.5/3	2/3
find uoft cs phds	93/200	78/200	18/200	116/200	35/200
find good companies	1/1	1/1	0/1	0/1	1/1
find where people work	0.5/1	0.5/1	0/1	0.9/1	0.7/1
parallel sessions in UI	0/1	1/1	0/1	1/1	1/1
Average	0.34	0.52	0.40	0.55	0.63

Whenever a frontier model fails a prompt I asked, I add it into my independent testing suite. Out of my prompts, this following one is the only one that withstood all the frontier LLMs, with no exceptions. I name this prompt "Make LLM Eval".

The prompt basically asks the LLM to create an eval that it cannot solve itself. Prompts like this can potentially be used in an RL training algorithm where learning iterates between an attacker and a defender, with continuous self-improvement potential.

Recently, I tried codex 5.4 with almost 1 hour of thinking time, and it still could not solve this question reliably.

from openai import OpenAI

def call_model(input):
    client = OpenAI()
    response = client.responses.create(
        model="gpt-5",
        tools=[
            {"type": "code_interpreter", "container": {"type": "auto"}}
        ],
        input=input,
    )
    return response.output_text

Write the complete Python code for a new LLM eval, where model calls use the above code. The eval has to satisfy the below constraints 
1. gpt-5 with code_interpreter fails the eval 
2. Many human experts can succeed on the eval, given the same inputs, time constraint, and tool (code_interpreter)
3. The human also has to comply with gpt-5's safety and content policies
4. The inputs and outputs are text only; no images or audio
5. The eval cannot exploit the fact that the AI does not have human's identify or physical attributes. 
6. The eval has to be about showing gaps between human intelligence and machine intelligence

Video games

Improving LLM performance on video games is a great step toward achieving physical intelligence. Simulated environments can be made to mimic real environments, and an algorithm that outperforms other algorithms in the simulated environment is likely to outperform in the real environment as well. Video games challenge LLMs on many fronts, including image understanding, long context, and reasoning.

There are many games in the browser and in mobile app stores, so how do we turn all those games into RL environments? I'm not sure and would be happy for someone to teach me. One thing I'm not a big fan of is developing games just for evaluating AIs because it is time-consuming and the additional value over existing games might be minimal.

A particular game that I care about is Go. How do we develop an LLM learning algorithm that can simultaneously teach the model Go, coding, and math? AlphaZero hard-coded the Monte Carlo tree search algorithm. To make the algorithm more general, the model should decide, through reasoning, which nodes to explore further and what data to learn from. We know it's possible to achieve superhuman-level strength in Go. Now we just have to do it one more time, with a more general algorithm that also works for coding and math. The impact of developing such an algorithm will extend far beyond just Go.

Web browsing

Web browsing is my main use case for LLMs. If one Google search request can give me the answer, I will use Google Search. Otherwise, I will use an LLM. Many of my queries require synthesizing information across 10+ articles, so an LLM can save me a lot of time.

The problem with web browsing is that the only popular benchmark, BrowseComp, is getting saturated by Claude Opus 4.6, and that most other web browsing benchmarks have no up-to-date leaderboards. Worse, I'm not sure whether all APIs support web search. This is one case where the app experience might be more advanced than the API experience.