🕷️How the Crawler Works

Understand how Hugo discovers and analyzes your website pages through sitemap parsing, robots.txt reading, and intelligent URL discovery.

Hugo Team·February 26, 2026
crawlersitemaprobots.txtdiscoverysubpages

When you submit a URL, Hugo doesn't just analyze that single page — it discovers and scans related subpages too. This gives you a holistic view of your site's SEO health. Here's exactly how the discovery process works.

URL Discovery Strategy

URL Discovery Pipeline
🤖robots.txtSitemap directives
🗺️sitemap.xmlStandard locations
📂Sub-sitemapsUp to 10
🔗Extract URLsUp to 20 pages
🔄FallbackParse <a> links

Hugo uses a multi-step strategy to find pages on your site:

  1. Parse your robots.txt file for Sitemap: directives.[1]
  2. Try standard sitemap locations: /sitemap.xml and /sitemap_index.xml.[2]
  3. If a sitemap index is found, parse up to 10 sub-sitemaps.
  4. Extract up to 20 subpage URLs from the sitemap entries.
  5. Fallback: If no sitemap exists, extract internal links from the homepage HTML.

Analysis Depth

The main page (the URL you enter) receives a deep analysis with all check categories including Performance. Subpages receive a lighter analysis — they skip performance checks and premium modules, focusing on metadata, content, technical, links, structured data, and accessibility.

ℹ️Concurrency

Subpages are analyzed concurrently in batches of 5 for faster results. The real-time dashboard shows progress as each page completes.

Robots.txt Compliance

Hugo checks your robots.txt to ensure your site is accessible to crawlers. If robots.txt contains a Disallow: / directive, Hugo will flag this as a warning since it blocks all crawlers from indexing your site.[1]

Security & Safety

Hugo includes built-in SSRF (Server-Side Request Forgery) protection.[3] It blocks requests to private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), localhost, and Docker internal addresses.[4] Only public URLs on ports 80 and 443 are analyzed.

References

  1. [1]Google Search Central — Introduction to robots.txt — developers.google.com
  2. [2]Sitemaps.org — XML Sitemap Protocol — sitemaps.org
  3. [3]OWASP — Server-Side Request Forgery Prevention Cheat Sheet — cheatsheetseries.owasp.org
  4. [4]IETF RFC 1918 — Address Allocation for Private Internets — rfc-editor.org

We value your privacy

We use localStorage to keep you signed in. No tracking cookies are set. Read our Cookie Policy and Privacy Policy for details.