AI crawler readiness · Reference

llms.txt vs robots.txt — two files, two jobs.

A surprising number of teams still ship one or the other and assume the second is redundant. They are not interchangeable. robots.txt is a negative gate that controls whether crawlers fetch your URLs. llms.txt is a positive index that tells AI engines which Markdown-rendered content represents your site. AI-ready sites ship both, and they ship them with different intent.

Companion to AI crawler readiness. Updated 2026-05-07.

At a glance

	robots.txt	llms.txt
Origin	Robots Exclusion Protocol, IETF RFC 9309 (2022). De facto since 1994.	llms.txt proposal by Jeremy Howard (Answer.AI), 2024. Community draft; not standardised.
Polarity	Negative. Default behaviour for compliant bots is "allow"; the file removes URLs from that allow set.	Positive. The file names the URLs and Markdown alternates an engine should read.
Audience	Any crawler that respects RFC 9309 — search bots, AI bots, archive bots, marketing scrapers.	AI engines that look for it — currently a partial set; expanding. Not a search-engine signal.
Path	`/robots.txt` at the host root.	`/llms.txt` at the host root.
Format	Text-based directives keyed by `User-agent:`; supports `Allow:`, `Disallow:`, `Sitemap:`, comments.	Structured Markdown: H1 site name, blockquote summary, sectioned link lists.
Enforcement	Honour-system. Compliant bots respect it; malicious bots ignore it.	Honour-system. There is no enforcement; engines fetch it if they look for it.
If absent	Default-allow. Bots crawl whatever they can find.	Engines fall back to HTML rendering and have to parse layout, navigation, and ads to extract content.

The simple rule. If you want a URL kept out of a crawler's reach, that is a robots.txt job. If you want an AI engine to find your preferred content quickly, that is an llms.txt job. Neither file does the other's work.

What robots.txt actually does

The Robots Exclusion Protocol was formalised in RFC 9309. The semantics are narrow: a compliant crawler reads the file before fetching anything else on the host, matches its User-agent against the listed groups, and treats any URL covered by Disallow: as off-limits. Allow: lifts a more specific path out of a broader Disallow:.

A minimal AI-aware robots.txt looks like this:

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /private/

User-agent: anthropic-ai
Disallow: /private/

Sitemap: https://www.example.com/sitemap.xml

Three things that robots.txt cannot do — and that teams routinely ask of it:

It cannot tell engines where the good content is. The file lists what is off-limits, not what is preferred. Engines still have to discover preferred URLs through other means (sitemap, on-page links, external citations).
It cannot pass a Markdown alternate. robots.txt has no concept of content format. If you serve /page/ as HTML and /page/index.md as Markdown, neither is mentioned to the bot — the bot has to find both via crawling.
It cannot block a non-compliant bot. The whole protocol is voluntary. If a bot ignores robots.txt, the file does nothing.

What llms.txt actually does

llms.txt was proposed in 2024 by Jeremy Howard (Answer.AI) as an AI-specific analogue to sitemap.xml with two important differences. First, it is Markdown rather than XML, which matches how LLMs prefer to ingest content. Second, it is editorial — it lists the content the site wants the AI to read, not every URL on the host.

The shape is a Markdown document with three load-bearing sections:

# Eastbound

> Eastbound is a Hong Kong-based China AI visibility consultancy.

## Links

- [Homepage](https://www.eastbound.ai/)
- [China AI visibility (pillar)](https://www.eastbound.ai/china-ai-visibility/)
- [AI visibility audit](https://www.eastbound.ai/ai-visibility-audit/)

## Markdown alternates

- [Homepage (Markdown)](https://www.eastbound.ai/index.md)
- [China AI visibility (Markdown)](https://www.eastbound.ai/china-ai-visibility/index.md)
- [AI visibility audit (Markdown)](https://www.eastbound.ai/ai-visibility-audit/index.md)

The format is simple on purpose. It is human-readable, AI-readable without parsing infrastructure, and lets the site editorially curate which content matters. Engines that look for /llms.txt can then prefer the listed Markdown alternates over rendered HTML — which means cleaner extraction with no navigation noise, no boilerplate, no advertising clutter.

What llms.txt cannot do:

It is not a block list. Listing fewer URLs in llms.txt does not stop engines from crawling the rest. If you want a URL excluded, that is a robots.txt entry.
It is not a search-engine signal. Google, Bing and Baidu do not currently use llms.txt for indexing. Whether they will is open.
It is not universally honoured. Engine support is partial. Many AI products read it; some do not. Treat it as upside, not as a guarantee.

Where the two files do not overlap (and where teams confuse them)

A common confusion: a team blocks GPTBot in robots.txt for training-opt-out reasons, then ships an llms.txt assuming it overrides the block for citation purposes. It does not. llms.txt has no enforcement layer; it is a hint. robots.txt is the gate. If GPTBot is disallowed, OpenAI's training crawler will not fetch anything regardless of what llms.txt says — and because OpenAI keeps separate user-agents for training (GPTBot) and search (OAI-SearchBot), the only way to be opted out of training while still appearing in ChatGPT search results is to allow OAI-SearchBot in robots.txt and disallow GPTBot.

Another confusion: assuming robots.txt: Sitemap: directive and llms.txt are equivalent because both list URLs. They are not. The Sitemap: directive points at sitemap.xml, which is an exhaustive index designed for search-engine crawl scheduling. llms.txt is editorial — the curated subset of content you want AI engines to prefer, with Markdown alternates listed alongside. The two coexist; they don't replace each other.

How to think about precedence. When both files apply, robots.txt always wins. A URL allowed in llms.txt but disallowed in robots.txt will not be fetched. The reverse is fine: a URL in robots.txt Allow space that is not in llms.txt can still be crawled — llms.txt just doesn't recommend it as a preferred entry point.

The minimum AI-ready configuration

For a site that wants to be cited by AI engines without being trained on, the typical baseline is:

/robots.txt — allow general crawling at User-agent: *; allow named search-and-citation bots (Googlebot, Bingbot, OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended if you are comfortable with Google AIO inclusion); disallow named training bots (GPTBot, anthropic-ai, CCBot) if your editorial position is opt-out-of-training; include Sitemap: directive.
/sitemap.xml — exhaustive, machine-readable URL list with <lastmod> and <priority>. Submit this URL to Google Search Console and Bing Webmaster Tools.
/llms.txt — editorial index of the pages you want AI engines to prefer, with a parallel "Markdown alternates" section listing the .md equivalents.
Per-page Markdown alternates — every page in llms.txt ships a corresponding .md file at the canonical path (e.g. /foo/index.html alongside /foo/index.md) and references it in HTML head with <link rel="alternate" type="text/markdown" href="/foo/index.md" />.
/llms-full.txt (optional) — a single-file long-form Markdown bundle of the site's most valuable content, useful for engines that prefer one fetch over many. Eastbound ships one alongside /llms.txt.

Each of these is a baseline-level intervention — measured improvement in crawler fetch behaviour is observable, not speculative. None of them are conversion levers in isolation: shipping these signals is necessary infrastructure, not a result. See AI crawler readiness for the full configuration guide and AI crawler blocking mistakes for the common failure modes.

Format gotchas worth knowing

Case sensitivity

robots.txt filenames are case-sensitive on most webservers. The file must be robots.txt at the root — not Robots.txt, not robot.txt. Same applies to llms.txt. Bots fetch the exact path; a typo means the file is silently absent.

Path placement

Both files belong at the host root, not in subdirectories. /blog/llms.txt is invisible to engines; only /llms.txt is checked. Per-host means per-subdomain too: blog.example.com needs its own /llms.txt and /robots.txt independent of www.example.com.

Content-Type

Both files should be served with Content-Type: text/plain (for robots.txt) or text/plain / text/markdown (for llms.txt — engines accept either). Some bots refuse files served as application/octet-stream. If you deploy from S3 or another object store, set the content type explicitly on upload.

Canonicalisation

If your site lives at www.example.com with apex example.com 301-redirecting, place robots.txt and llms.txt on the canonical host (www.example.com). Bots that follow redirects to fetch example.com/robots.txt generally respect the redirected file, but some implementations cache the apex 404 instead. The safe pattern is to serve the file from both apex and www.

Use this with the rest of the readiness layer

Configuring robots.txt and llms.txt is necessary but not sufficient. The full AI crawler readiness checklist covers structured data, Markdown alternates per page, sitemap discipline, and IndexNow setup for Bing/Copilot. Read the parent reference, then check your site against common blocking mistakes. Once readiness is clean, the next layer of work is on-site content — see the China AI visibility pillar.

Audit my AI crawler setup →