Finding Unsaturated Evals

I have compiled a list of unsaturated evals, most of which have up-to-date public leaderboards. Then I created some hard prompts for independent testing. Finally, I discuss the need for more video game and web browsing evals.

Unsaturated evals

CategoryBenchmarkScoreBest model
CodingSWE-Bench Pro (Public)58%GPT-5.4
CodingVibe Code Bench67%GPT-5.4
CodingSlopCodeBench17%Claude Opus 4.6
CodingCodebase QnA41%GPT-5.4
PuzzlesSimpleBench80%Gemini 3.1 Pro
ChessChess1834 EloGemini 3.0 Pro
MathematicsFrontierMath50%GPT-5.4 Pro
GeneralArena.ai Leaderboard1.5k EloClaude Opus 4.6
VisionZeroBench23%GPT-5.4
KnowledgeHumanity's Last Exam53%Gemini 3 Deep Think
AgenticRemote Labor Index4.2%Claude Opus 4.6
AgenticVending-Bench 28.0k / 63kClaude Opus 4.6

Recently saturated evals

CategoryBenchmarkScoreBest model
MathematicsPutnamBench668 / 672Aleph Prover (Logical Intelligence)
Web BrowsingBrowseComp84%Claude Opus 4.6
PuzzlesARC-AGI 285%Gemini 3 Deep Think
Long contextMRCR (1M, 8 needles)76%Claude Opus 4.6
CodingTerminal-Bench 2.082%GPT-5.4
AgenticGDPval83%GPT-5.4

Independent testing

MetricClaude Opus 4.6 Extended ThinkingGrok 4.1Gemini 3.1 ProGPT-5.4 Extended ThinkingMeta Muse
find AI benchmarks6/108/108/107/108/10
find benchmark numbers10/129/128.5/129/128/12
Health7.5/94/94/96.5/96/9
Education1/11/11/10/11/1
Make LLM Eval0/10/10/10/10/1
find uoft cs phds86/20058/20018/20050/20023/200
find frontier ai labs31/10031/10019/10050/10028/100
Tax 10.5/10/11/10.5/10.5/1
Immigration 11/30/30/30/33/3
Immigration 21/30/33/31/33/3
find companies with hard interviews0/11/10/11/10/1
optimizer0.5/10/11/11/1
affiliation0.6/10/11/10.6/1
Avg47.0%43.8%40.3%52.0%58.7%

Whenever a frontier model fails a prompt I asked, I add it into my independent testing suite. Out of my prompts, this following one is the only one that withstood all the frontier LLMs, with no exceptions. I name this prompt "Make LLM Eval".

The prompt basically asks the LLM to create an eval that it cannot solve itself. Prompts like this can potentially be used in an RL training algorithm where learning iterates between an attacker and a defender, with continuous self-improvement potential.

Recently, I tried codex 5.4 with almost 1 hour of thinking time, and it still could not solve this question reliably.

from openai import OpenAI

def call_model(input):
    client = OpenAI()
    response = client.responses.create(
        model="gpt-5",
        tools=[
            {"type": "code_interpreter", "container": {"type": "auto"}}
        ],
        input=input,
    )
    return response.output_text

Write the complete Python code for a new LLM eval, where model calls use the above code. The eval has to satisfy the below constraints 
1. gpt-5 with code_interpreter fails the eval 
2. Many human experts can succeed on the eval, given the same inputs, time constraint, and tool (code_interpreter)
3. The human also has to comply with gpt-5's safety and content policies
4. The inputs and outputs are text only; no images or audio
5. The eval cannot exploit the fact that the AI does not have human's identify or physical attributes. 
6. The eval has to be about showing gaps between human intelligence and machine intelligence

Q6

Not a prompt for the LLM.
Check whether the conversation persist after a website refresh,
and whether the website allows many sessions at once.

Sadly, Gemini and Claude have some restrictions here, perhaps to slow per-user usage.

Video games

Improving LLM performance on video games is a great step toward achieving physical intelligence. Simulated environments can be made to mimic real environments, and an algorithm that outperforms other algorithms in the simulated environment is likely to outperform in the real environment as well. Video games challenge LLMs on many fronts, including image understanding, long context, and reasoning.

There are many games in the browser and in mobile app stores, so how do we turn all those games into RL environments? I'm not sure and would be happy for someone to teach me. One thing I'm not a big fan of is developing games just for evaluating AIs because it is time-consuming and the additional value over existing games might be minimal.

A particular game that I care about is Go. How do we develop an LLM learning algorithm that can simultaneously teach the model Go, coding, and math? AlphaZero hard-coded the Monte Carlo tree search algorithm. To make the algorithm more general, the model should decide, through reasoning, which nodes to explore further and what data to learn from. We know it's possible to achieve superhuman-level strength in Go. Now we just have to do it one more time, with a more general algorithm that also works for coding and math. The impact of developing such an algorithm will extend far beyond just Go.

Web browsing

Web browsing is my main use case for LLMs. If one Google search request can give me the answer, I will use Google Search. Otherwise, I will use an LLM. Many of my queries require synthesizing information across 10+ articles, so an LLM can save me a lot of time.

The problem with web browsing is that the only popular benchmark, BrowseComp, is getting saturated by Claude Opus 4.6, and that most other web browsing benchmarks have no up-to-date leaderboards. Worse, I'm not sure whether all APIs support web search. This is one case where the app experience might be more advanced than the API experience.