LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Juliusz Ziomek*, William Bankes*, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

Developed by a team from

University 1 logo University 2 logo University 3 logo

Logos displayed to indicate institutional affiliations of the authors.

Medium Level Results
Hard Level Plot

The Challenge: In LLM-Wikirace, models must navigate Wikipedia hyperlinks step-by-step to reach a target page from a source page. This requires strategic look-ahead planning and the ability to reason about real-world concepts.

Key Findings:

  • Superhuman on Easy: Frontier models (Gemini-3, Claude Opus 4.5, and GPT-5.2) dominate simple paths.
  • Planning Gap: Models with similar world knowledge exhibit a significant performance gap.
  • The Loop Problem: Models struggle to replan after failure, frequently becoming stuck in navigation loops.

Methodology Overview

LLM-Wikirace Methodology Illustration

How the Game Works

  • Reach The Target The objective is to reach a target page, starting with a given source page. At each step, the agent receives a list of the outgoing hyperlinks from current page and must select one link to follow, which moves it to a new page.
  • Step Limit If the agent fails to reach the target within 30 steps, the game ends and the agent loses.
  • Partial Observability At each step only the next available links in the hyperlink graph are given to the agent. The agent must infer the rest of the graph structure using it’s own internal world knowledge and make current decision under uncertainty.
  • Varying Difficulties Benchmark features three difficulties: Easy, Medium, and Hard, defined by the shortest-path distance between source and target pages. Whilst models perform well at the Easy difficulty, the Hard difficulty reveals flaws in their long-horizon planning.

Key Insights

Planning Gap

While world knowledge correlates with success, it alone does not explain performance. We observe a transition from a knowledge-limited to a planning-limited regime, where models with comparable knowledge diverge due to failures in replanning and adaptive control.

World Knowledge vs Success Rate — planning capability is the key differentiator

How do models plan and adapt?

While models adopt sensible high-level strategies we find that ones that enter navigation loops more frequently perform significantly worse. This highlights a core weakness: once an agent takes a wrong turn, it struggles to replan and break out of repetitive cycles.

Loop Frequency vs Success Rate — models that loop more perform worse

Leaderboard

Model Easy Success Medium Success Hard Success Tokens / Step
GPT-5 Nano 71.5% 24.7% 4.0% 2170
GPT-5 Mini 85.5% 46.0% 11.0% 874
GPT-5 92.5% 60.0% 15.0% 1826
Gemini 2.0 Flash 88.0% 41.3% 6.0% 501
Gemini 2.5 Flash 91.0% 53.0% 10.2% 547
Gemini 2.5 91.0% 56.7% 15.2% 527
Gemini 3 95.0% 66.0% 23.0% 1848
Claude Sonnet 4.5 88.5% 43.3% 10.1% 1242
Claude Opus 4.5 91.5% 56.0% 18.0% 1906
Grok 4.1-Fast 90.0% 44.7% 5.0% 4458
DeepSeek R1 91.0% 54.7% 17.0% 2598
Kimi K2 87.5% 45.3% 7.0% 8105
LLaMA 3 1B 16.5% 0.0% 0.0% 511
LLaMA 3 3B 47.0% 3.3% 0.0% 783
LLaMA 3 8B 64.5% 9.3% 0.0% 641
LLaMA 3 70B 84.5% 39.3% 7.0% 651
Gemma 3 4B 48.0% 2.7% 0.0% 555
Gemma 3 12B 72.5% 22.7% 1.0% 651
Gemma 3 27B 80.0% 30.0% 0.0% 684
Apertus 8B 42.0% 4.0% 0.0% 805
Apertus 70B 65.0% 10.7% 0.0% 1832
Mistral 7B 59.0% 10.0% 1.0% 664
Ministral 8B 65.5% 8.7% 0.0% 624
Qwen 2.5-7B 22.5% 1.3% 0.0% 2597
Dream-v0-Inst. 7B 53.0% 3.3% 1.0% 1549
LLaDA-Inst. 8B 40.5% 4.7% 0.0% 1669

Citation

@article{{ziomek2026llmwikirace,
      title={LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs}, 
      author={Juliusz Ziomek and William Bankes and Lorenz Wolf and Shyam Sundhar Ramesh and Xiaohang Tang and Ilija Bogunovic},
      year={2026},
      eprint={2602.16902},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2602.16902}, 
}}