LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Juliusz Ziomek^*, William Bankes^*, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

Developed by a team from

Logos displayed to indicate institutional affiliations of the authors.

Read the Paper View the Code

The Challenge: In LLM-Wikirace, models must navigate Wikipedia hyperlinks step-by-step to reach a target page from a source page. This requires strategic look-ahead planning and the ability to reason about real-world concepts.

Key Findings:

Superhuman on Easy: Frontier models (Gemini-3, Claude Opus 4.5, and GPT-5.2) dominate simple paths.
Planning Gap: Models with similar world knowledge exhibit a significant performance gap.
The Loop Problem: Models struggle to replan after failure, frequently becoming stuck in navigation loops.

Methodology Overview

How the Game Works

Reach The Target The objective is to reach a target page, starting with a given source page. At each step, the agent receives a list of the outgoing hyperlinks from current page and must select one link to follow, which moves it to a new page.
Step Limit If the agent fails to reach the target within 30 steps, the game ends and the agent loses.
Partial Observability At each step only the next available links in the hyperlink graph are given to the agent. The agent must infer the rest of the graph structure using it’s own internal world knowledge and make current decision under uncertainty.
Varying Difficulties Benchmark features three difficulties: Easy, Medium, and Hard, defined by the shortest-path distance between source and target pages. Whilst models perform well at the Easy difficulty, the Hard difficulty reveals flaws in their long-horizon planning.

Key Insights

Planning Gap

While world knowledge correlates with success, it alone does not explain performance. We observe a transition from a knowledge-limited to a planning-limited regime, where models with comparable knowledge diverge due to failures in replanning and adaptive control.

World Knowledge vs Success Rate — planning capability is the key differentiator

How do models plan and adapt?

While models adopt sensible high-level strategies we find that ones that enter navigation loops more frequently perform significantly worse. This highlights a core weakness: once an agent takes a wrong turn, it struggles to replan and break out of repetitive cycles.

Loop Frequency vs Success Rate — models that loop more perform worse

Leaderboard

Model ↕	Easy Success ↕	Medium Success ↕	Hard Success ↕	Tokens / Step ↕
GPT-5 Nano	71.5%	24.7%	4.0%	2170
GPT-5 Mini	85.5%	46.0%	11.0%	874
GPT-5	92.5%	60.0%	15.0%	1826
Gemini 2.0 Flash	88.0%	41.3%	6.0%	501
Gemini 2.5 Flash	91.0%	53.0%	10.2%	547
Gemini 2.5	91.0%	56.7%	15.2%	527
Gemini 3	95.0%	66.0%	23.0%	1848
Claude Sonnet 4.5	88.5%	43.3%	10.1%	1242
Claude Opus 4.5	91.5%	56.0%	18.0%	1906
Grok 4.1-Fast	90.0%	44.7%	5.0%	4458
DeepSeek R1	91.0%	54.7%	17.0%	2598
Kimi K2	87.5%	45.3%	7.0%	8105
LLaMA 3 1B	16.5%	0.0%	0.0%	511
LLaMA 3 3B	47.0%	3.3%	0.0%	783
LLaMA 3 8B	64.5%	9.3%	0.0%	641
LLaMA 3 70B	84.5%	39.3%	7.0%	651
Gemma 3 4B	48.0%	2.7%	0.0%	555
Gemma 3 12B	72.5%	22.7%	1.0%	651
Gemma 3 27B	80.0%	30.0%	0.0%	684
Apertus 8B	42.0%	4.0%	0.0%	805
Apertus 70B	65.0%	10.7%	0.0%	1832
Mistral 7B	59.0%	10.0%	1.0%	664
Ministral 8B	65.5%	8.7%	0.0%	624
Qwen 2.5-7B	22.5%	1.3%	0.0%	2597
Dream-v0-Inst. 7B	53.0%	3.3%	1.0%	1549
LLaDA-Inst. 8B	40.5%	4.7%	0.0%	1669

Citation

@article{{ziomek2026llmwikirace,
      title={LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs}, 
      author={Juliusz Ziomek and William Bankes and Lorenz Wolf and Shyam Sundhar Ramesh and Xiaohang Tang and Ilija Bogunovic},
      year={2026},
      eprint={2602.16902},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2602.16902}, 
}}

View on GitHub