LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?
Juliusz Ziomek*, William Bankes*, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic
Developed by a team from
Logos displayed to indicate institutional affiliations of the authors.
The Challenge: In LLM-Wikirace, models must navigate Wikipedia hyperlinks step-by-step to reach a target page from a source page. This requires strategic look-ahead planning and the ability to reason about real-world concepts.
Key Findings:
- Superhuman on Easy: Frontier models (Gemini-3, Claude Opus 4.5, and GPT-5.2) dominate simple paths.
- Planning Gap: Models with similar world knowledge exhibit a significant performance gap.
- The Loop Problem: Models struggle to replan after failure, frequently becoming stuck in navigation loops.
Methodology Overview
How the Game Works
- Reach The Target The objective is to reach a target page, starting with a given source page. At each step, the agent receives a list of the outgoing hyperlinks from current page and must select one link to follow, which moves it to a new page.
- Step Limit If the agent fails to reach the target within 30 steps, the game ends and the agent loses.
- Partial Observability At each step only the next available links in the hyperlink graph are given to the agent. The agent must infer the rest of the graph structure using it’s own internal world knowledge and make current decision under uncertainty.
- Varying Difficulties Benchmark features three difficulties: Easy, Medium, and Hard, defined by the shortest-path distance between source and target pages. Whilst models perform well at the Easy difficulty, the Hard difficulty reveals flaws in their long-horizon planning.
Key Insights
Planning Gap
While world knowledge correlates with success, it alone does not explain performance. We observe a transition from a knowledge-limited to a planning-limited regime, where models with comparable knowledge diverge due to failures in replanning and adaptive control.
How do models plan and adapt?
While models adopt sensible high-level strategies we find that ones that enter navigation loops more frequently perform significantly worse. This highlights a core weakness: once an agent takes a wrong turn, it struggles to replan and break out of repetitive cycles.
Leaderboard
| Model ↕ | Easy Success ↕ | Medium Success ↕ | Hard Success ↕ | Tokens / Step ↕ |
|---|---|---|---|---|
| GPT-5 Nano | 71.5% | 24.7% | 4.0% | 2170 |
| GPT-5 Mini | 85.5% | 46.0% | 11.0% | 874 |
| GPT-5 | 92.5% | 60.0% | 15.0% | 1826 |
| Gemini 2.0 Flash | 88.0% | 41.3% | 6.0% | 501 |
| Gemini 2.5 Flash | 91.0% | 53.0% | 10.2% | 547 |
| Gemini 2.5 | 91.0% | 56.7% | 15.2% | 527 |
| Gemini 3 | 95.0% | 66.0% | 23.0% | 1848 |
| Claude Sonnet 4.5 | 88.5% | 43.3% | 10.1% | 1242 |
| Claude Opus 4.5 | 91.5% | 56.0% | 18.0% | 1906 |
| Grok 4.1-Fast | 90.0% | 44.7% | 5.0% | 4458 |
| DeepSeek R1 | 91.0% | 54.7% | 17.0% | 2598 |
| Kimi K2 | 87.5% | 45.3% | 7.0% | 8105 |
| LLaMA 3 1B | 16.5% | 0.0% | 0.0% | 511 |
| LLaMA 3 3B | 47.0% | 3.3% | 0.0% | 783 |
| LLaMA 3 8B | 64.5% | 9.3% | 0.0% | 641 |
| LLaMA 3 70B | 84.5% | 39.3% | 7.0% | 651 |
| Gemma 3 4B | 48.0% | 2.7% | 0.0% | 555 |
| Gemma 3 12B | 72.5% | 22.7% | 1.0% | 651 |
| Gemma 3 27B | 80.0% | 30.0% | 0.0% | 684 |
| Apertus 8B | 42.0% | 4.0% | 0.0% | 805 |
| Apertus 70B | 65.0% | 10.7% | 0.0% | 1832 |
| Mistral 7B | 59.0% | 10.0% | 1.0% | 664 |
| Ministral 8B | 65.5% | 8.7% | 0.0% | 624 |
| Qwen 2.5-7B | 22.5% | 1.3% | 0.0% | 2597 |
| Dream-v0-Inst. 7B | 53.0% | 3.3% | 1.0% | 1549 |
| LLaDA-Inst. 8B | 40.5% | 4.7% | 0.0% | 1669 |
Citation
@article{{ziomek2026llmwikirace,
title={LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs},
author={Juliusz Ziomek and William Bankes and Lorenz Wolf and Shyam Sundhar Ramesh and Xiaohang Tang and Ilija Bogunovic},
year={2026},
eprint={2602.16902},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2602.16902},
}}