# AI Crawler Readiness — robots.txt, llms.txt, Markdown Alternates

> The infrastructure layer of generative engine optimization. Configure robots.txt for AI crawlers, ship llms.txt, serve Markdown alternates so ChatGPT, Claude, Gemini, Perplexity, DeepSeek, Qwen and Doubao can actually read your pages.

Companion to [China AI visibility for global brands](https://www.eastbound.ai/china-ai-visibility/). Updated 2026-05-06.

## Why crawler readiness is upstream of every other GEO lever

Generative answer engines do not invent citations. They pull from a source pool — partly indexed, partly retrieved live, partly cached from earlier crawls. If your site is not in the pool, content quality, structured data, and length sweet spots are irrelevant. You are not in the running.

The infrastructure layer covers five things that are individually small and collectively decisive: granular `robots.txt` across crawler buckets, `llms.txt` as an AI-readable index, Markdown alternates so engines can parse pages without HTML noise, structured data (with the engine-specific caveat below), and discoverability files — sitemap, canonical, IndexNow.

**What this page is not.** This is a configuration page, not a results page. Shipping these signals is necessary; it is not sufficient. Brands with perfect crawler readiness still fail to surface if they have no third-party citation footprint. We label every recommendation here as *measured baseline* (we and our clients ship this and crawlers fetch correctly) versus *intervention hypothesis* (we expect this to help; before/after measurement required). The infrastructure-layer items below are baseline, not hypothesis.

## robots.txt — five buckets, not one

Most sites still treat `robots.txt` as a single switch — "block bots / allow bots." That collapses five very different bot categories into a coarse policy that almost always blocks something you wanted to allow, or allows something you wanted to block. The five buckets:

| Bucket | Examples | Default policy |
|---|---|---|
| Search / retrieval | Googlebot, Bingbot, OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended for AIO | Allow |
| User-triggered | ChatGPT-User, Perplexity-User, Claude-User | Allow |
| Training | GPTBot, anthropic-ai, Google-Extended (training-only mode), Bytespider | Brand decision — opt-out is legitimate |
| Common-crawl / corpora | CCBot | Brand decision — affects future-model training pools |
| Undeclared / unknown | Bots that don't identify or send mismatched UA | Default-allow at `User-agent: *` level |

A common failure: blocking `GPTBot` for training reasons via an overbroad rule that also catches `OAI-SearchBot`. Result: ChatGPT search can no longer fetch your pages. The brand wanted to opt out of training and accidentally opted out of citation. Always verify by user-agent name, not regex.

A workable default for a brand that wants AI visibility while opting out of training:

```
# Search / retrieval — allow
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /

# User-triggered fetches — allow
User-agent: ChatGPT-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-User
Allow: /

# Training — opt-out (brand choice; not required for visibility)
User-agent: GPTBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /

# Default — allow
User-agent: *
Allow: /

Sitemap: https://your-brand.com/sitemap.xml
```

For Mainland China engines specifically, the named user-agents to allow include `Baiduspider`, `Bytespider` (ByteDance — feeds Doubao), `Sogou web spider`, and `360Spider`. DeepSeek and Qwen do not currently publish stable named user-agents that all sites recognise; the sane default is to leave `User-agent: *` as `Allow: /` so that undeclared crawlers can still fetch.

## llms.txt and llms-full.txt — the AI-readable index

`llms.txt` is an emerging convention, proposed in 2024 and increasingly observed by AI tools (notably Claude's Code interface, Cursor, Continue, several research-grade scrapers, and at least one production search-engine ingest). It is a single Markdown file at the site root that gives an AI a curated list of canonical URLs for the site, a short description, and (optionally) Markdown alternates for each URL.

The format is opinionated and minimal:

```
# Your Brand

> One-paragraph description of what your brand does and who it serves.

## Links

- [Homepage](https://your-brand.com/)
- [Pillar / category page](https://your-brand.com/category/)
- [Pricing](https://your-brand.com/pricing/)
- [About](https://your-brand.com/about/)

## Markdown alternates

- [Homepage (Markdown)](https://your-brand.com/index.md)
- [Pillar (Markdown)](https://your-brand.com/category/index.md)
```

`llms-full.txt` is a heavier sibling — same idea, but contains the actual page text concatenated rather than only links. It's appropriate for brands that want to ship a complete LLM-readable corpus (a knowledge base, a research site, a docs portal). Both files belong at the site root: `/llms.txt` and `/llms-full.txt`.

**Hedge.** `llms.txt` adoption by major commercial engines (ChatGPT, Gemini, Copilot) is not officially confirmed. Treat shipping `llms.txt` as a low-cost baseline hygiene step that supports tools we know read it (Claude Code, Cursor, several research crawlers) and may support future engine ingests. We have not measured a citation-rate lift directly attributable to `llms.txt` presence; until we run a controlled before/after, treat it as *infrastructure baseline, not measured intervention*.

## Markdown alternates — let parsers skip the noise

Modern AI clients can request a Markdown version of a page by sending `Accept: text/markdown` in the request, or by following a `<link rel="alternate" type="text/markdown" href="...">` declaration. A clean Markdown alternate is dramatically easier for an LLM to ingest than HTML — there's no nav clutter, no analytics scripts, no cookie-consent overlay, no pixel-tracker noise.

Two implementation patterns:

1. **Per-page `.md` file** alongside the HTML, with a `<link rel="alternate">` declaration in the HTML head pointing to it. Lowest friction. Good for static-site or hybrid stacks. Eastbound uses this pattern across all 24+ public pages.
2. **Content negotiation** on the same URL — server inspects the `Accept` header and returns Markdown or HTML accordingly. More elegant, more infrastructure work. Common in docs platforms.

The Markdown alternate should be the actual article body — headlines, prose, tables, citations — without the page chrome. Don't include navigation menus, footer links, or cookie banners. Don't auto-generate from HTML by stripping tags; the result is usually messy. Hand-author or use a clean Markdown source as the canonical and render HTML from it.

Verifying Markdown alternates work end-to-end:

```
curl -sI https://your-brand.com/page/ | grep -i "Link\|Content-Type"
curl -s -H "Accept: text/markdown" https://your-brand.com/page/index.md | head -40
```

## Structured data — Bing/Copilot signal, not universal

JSON-LD schema is widely sold as a "universal AI signal." That claim doesn't hold up in our experimental sample. Across 500+ multi-engine prompts in May 2026, JSON-LD presence was a measurable positive signal for Bing's AI surfaces (Copilot in particular) but not for ChatGPT, Claude, or Perplexity to a degree we could detect at our sample size. Independent work by SearchVIU reaches a consistent conclusion.

What this means in practice: ship JSON-LD because it helps Bing/Copilot, helps with Google rich results, and is cheap to produce. Do not justify the work as a universal AI lift, and do not deprioritise off-site source-graph work in favour of more JSON-LD.

The minimum useful schema set for a brand site:

- **Organization** on the homepage — name, URL, logo, sameAs links to LinkedIn / X / Wikidata if you have a Q-item.
- **WebSite** on the homepage — name, URL, search action.
- **WebPage** on every page — at least `@id`, `url`, `name`, `description`, `isPartOf` the WebSite.
- **BreadcrumbList** on category / spoke pages — explicit breadcrumb trail.
- **Article** for blog/research posts; **Service** for tools; **Product** for catalogue pages where applicable.

What to skip: **FAQPage** JSON-LD on AI-citation-priority pages. The format invites pages to bolt on a "questions" section that adds no new information, lowering signal density. We have observed FAQPage-heavy pages underperform pages with the same source content reorganised as encyclopedic prose. This is a recommendation against form, not against substance — if your users actually have questions, answer them, but answer in prose.

## Sitemap, canonical, and IndexNow — the boring discoverability layer

Three small things that quietly determine how fast new content reaches an answer engine's index:

### Sitemap.xml

Ship one at `/sitemap.xml`. Reference it from `robots.txt`. Update `<lastmod>` when content changes — engines weight recently-modified URLs higher in their re-crawl queues. Don't list every internal asset; list the canonical pages you want crawled and indexed. For a site with 20–50 important pages, the file should be 20–50 entries. Anything 10× larger is usually accidental — staging URLs, duplicate query strings, paginated archives.

### Canonical tags

Every page needs a `<link rel="canonical" href="...">` pointing at the URL you want indexed. Common failures: canonical pointing at HTTP when site is HTTPS, canonical pointing at staging domain, canonical pointing at trailing-slash variant when sitemap uses no-slash variant. Engines will not aggregate signal across variants automatically; pick one form and use it consistently in canonical, sitemap, and internal links.

### IndexNow

IndexNow is a Microsoft-led ping protocol that lets you proactively notify Bing (and any IndexNow-participating engine) when a URL is created, updated, or deleted. Drop a key file at `/<your-key>.txt`, then `POST` the URL list to `https://api.indexnow.org/indexnow` on every publish. Bing surfaces IndexNow-pinged URLs noticeably faster — typically same-day vs the standard crawl-budget cycle. Worth wiring into your build pipeline.

## Common failure modes we see in audits

1. **Disallow: /_next/ blocking JavaScript bundles.** Modern frameworks fetch JS chunks from `/_next/` or `/_app/` — blocking these breaks rendered content for crawlers that execute JS (Googlebot, Bingbot Smart Crawler, recent versions of GPTBot). Result: bots see an empty shell.
2. **Conflicting canonical and og:url.** Canonical says one URL; `og:url` says another. Some engines treat `og:url` as the citable URL. Always make them match.
3. **Markdown alternate that's auto-generated tag-stripped HTML.** Looks like Markdown, parses badly. Use clean source-of-truth Markdown.
4. **Multiple sitemaps without a sitemap-index file.** Engines may crawl only the first one. Ship a sitemap index.
5. **Crawl-delay directives intended for one bot applied to all.** A 30-second crawl-delay set under `User-agent: *` throttles every well-behaved AI crawler. Apply selectively or not at all.
6. **llms.txt that lists only the homepage.** The point is to surface the cluster — pillar, spokes, blog, about. A one-link `llms.txt` tells the AI "we don't have a curated index."
7. **Cloudflare "Block AI Bots" rule applied broadly.** Cloudflare's managed AI-bot block list includes named user-agents you may want to allow. Audit the rule before enabling. Cloudflare's "Manage robots.txt" feature can also overwrite hand-authored `robots.txt` if both are configured.

## How to verify your readiness end-to-end

A tight verification pass takes about 15 minutes:

1. **robots.txt fetch** — `curl -sI https://your-brand.com/robots.txt` returns 200 and the body lists each AI bucket explicitly.
2. **llms.txt fetch** — `curl -sI https://your-brand.com/llms.txt` returns 200 with `Content-Type: text/plain` or `text/markdown`.
3. **Markdown alternate** — visit any page, view source, confirm `<link rel="alternate" type="text/markdown">` declares an existing URL that returns 200 with Markdown body.
4. **Canonical consistency** — visit homepage and any inner page, verify canonical, og:url, and sitemap entry all match exactly (HTTP/HTTPS, www/no-www, trailing-slash).
5. **JSON-LD parses** — paste page source into Google's Rich Results Test or Schema.org Markup Validator. Zero errors. Warnings are usually fine.
6. **Live AI test** — open ChatGPT, Claude, Perplexity, and DeepSeek's chat interface. Ask "what is `<brand-name>`?" If the engine cites your domain in the answer or in the sources, the readiness layer is working. If it doesn't, the next bottleneck is content or third-party citations, not infrastructure.

The Eastbound free audit runs the above checks automatically and grades your site against the citation pyramid — onsite readiness, AI-parseability, and selection-worthiness — for ChatGPT, DeepSeek, Qwen and Doubao. [Run the audit on your URL](https://www.eastbound.ai/ai-visibility-audit/).

## What comes after readiness

Crawler readiness gets your pages into the source pool. Two further layers determine whether they get cited:

- **Selection-worthiness** — does the page have specifics, named entities, recent dates, real numbers? Vague pages get pulled less often. [The pillar page](https://www.eastbound.ai/china-ai-visibility/) covers this in detail.
- **Off-site substrate** — third-party citations from Reddit, Zhihu, Wikipedia, vertical media. This is the highest-leverage and slowest layer. Our [off-site substrate study](https://www.eastbound.ai/blog/off-site-substrate.html) walks through the source-graph evidence.

For brands targeting Mainland China specifically, see also the per-engine playbooks: [DeepSeek SEO](https://www.eastbound.ai/deepseek-seo/), [Qwen optimization](https://www.eastbound.ai/qwen-optimization/), [Doubao optimization](https://www.eastbound.ai/doubao-optimization/).

## Where to go from here

If you want a confidential read of where your brand sits across all five infrastructure-layer signals plus the upstream citation-pyramid layers, [run the free audit](https://www.eastbound.ai/ai-visibility-audit/) or [book a 30-minute fit check](https://www.eastbound.ai/book-consultation/).

Or browse the [China AI visibility pillar](https://www.eastbound.ai/china-ai-visibility/) for the full topic map.