China AI visibility · Crawler readiness
AI crawler readiness — the infrastructure layer.
Before any answer engine can recommend your brand, three things have to be true: the engine's bot can fetch your pages, the bot can parse what it gets back, and the bot is allowed to. Most "AI visibility" wins start here, not at content. This page is a working configuration guide — the same one we ship for our own audits.
Companion to China AI visibility for global brands. Updated 2026-05-06.
Why crawler readiness is upstream of every other GEO lever
Generative answer engines do not invent citations. They pull from a source pool — partly indexed, partly retrieved live, partly cached from earlier crawls. If your site is not in the pool, content quality, structured data, and length sweet spots are irrelevant. You are not in the running.
The infrastructure layer covers five things that are individually small and collectively decisive: granular robots.txt across crawler buckets, llms.txt as an AI-readable index, Markdown alternates so engines can parse pages without HTML noise, structured data (with the engine-specific caveat below), and discoverability files — sitemap, canonical, IndexNow. None of this is glamorous. All of it is necessary.
robots.txt — five buckets, not one
Most sites still treat robots.txt as a single switch — "block bots / allow bots." That collapses five very different bot categories into a coarse policy that almost always blocks something you wanted to allow, or allows something you wanted to block. The five buckets:
| Bucket | Examples | Default policy |
|---|---|---|
| Search / retrieval | Googlebot, Bingbot, OAI-SearchBot, PerplexityBot, ClaudeBot (Anthropic), Google-Extended for AIO | Allow |
| User-triggered | ChatGPT-User, Perplexity-User, Claude-User | Allow |
| Training | GPTBot, anthropic-ai, Google-Extended (training-only mode), Bytespider | Brand decision — opt-out is legitimate |
| Common-crawl / corpora | CCBot | Brand decision — affects future-model training pools |
| Undeclared / unknown | Bots that don't identify or send mismatched UA | Default-allow at User-agent: * level |
A common failure: blocking GPTBot for training reasons via an overbroad rule that also catches OAI-SearchBot. Result: ChatGPT search can no longer fetch your pages. The brand wanted to opt out of training and accidentally opted out of citation. Always verify by user-agent name, not regex.
A workable default for a brand that wants AI visibility while opting out of training:
# Search / retrieval — allow
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
# User-triggered fetches — allow
User-agent: ChatGPT-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-User
Allow: /
# Training — opt-out (brand choice; not required for visibility)
User-agent: GPTBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Default — allow
User-agent: *
Allow: /
Sitemap: https://your-brand.com/sitemap.xml
For Mainland China engines specifically, the named user-agents to allow include Baiduspider, Bytespider (ByteDance — feeds Doubao), Sogou web spider, and 360Spider. DeepSeek and Qwen do not currently publish stable named user-agents that all sites recognise; the sane default is to leave User-agent: * as Allow: / so that undeclared crawlers can still fetch.
llms.txt and llms-full.txt — the AI-readable index
llms.txt is an emerging convention, proposed in 2024 and increasingly observed by AI tools (notably Claude's Code interface, Cursor, Continue, several research-grade scrapers, and at least one production search-engine ingest). It is a single Markdown file at the site root that gives an AI a curated list of canonical URLs for the site, a short description, and (optionally) Markdown alternates for each URL.
The format is opinionated and minimal:
# Your Brand
> One-paragraph description of what your brand does and who it serves.
## Links
- [Homepage](https://your-brand.com/)
- [Pillar / category page](https://your-brand.com/category/)
- [Pricing](https://your-brand.com/pricing/)
- [About](https://your-brand.com/about/)
## Markdown alternates
- [Homepage (Markdown)](https://your-brand.com/index.md)
- [Pillar (Markdown)](https://your-brand.com/category/index.md)
llms-full.txt is a heavier sibling — same idea, but contains the actual page text concatenated rather than only links. It's appropriate for brands that want to ship a complete LLM-readable corpus (a knowledge base, a research site, a docs portal). Both files belong at the site root: /llms.txt and /llms-full.txt.
llms.txt adoption by major commercial engines (ChatGPT, Gemini, Copilot) is not officially confirmed. Treat shipping llms.txt as a low-cost baseline hygiene step that supports tools we know read it (Claude Code, Cursor, several research crawlers) and may support future engine ingests. We have not measured a citation-rate lift directly attributable to llms.txt presence; until we run a controlled before/after, treat it as infrastructure baseline, not measured intervention.
Markdown alternates — let parsers skip the noise
Modern AI clients can request a Markdown version of a page by sending Accept: text/markdown in the request, or by following a <link rel="alternate" type="text/markdown" href="..."> declaration. A clean Markdown alternate is dramatically easier for an LLM to ingest than HTML — there's no nav clutter, no analytics scripts, no cookie-consent overlay, no pixel-tracker noise.
Two implementation patterns:
- Per-page
.mdfile alongside the HTML, with a<link rel="alternate">declaration in the HTML head pointing to it. Lowest friction. Good for static-site or hybrid stacks. Eastbound uses this pattern across all 24+ public pages. - Content negotiation on the same URL — server inspects the
Acceptheader and returns Markdown or HTML accordingly. More elegant, more infrastructure work. Common in docs platforms.
The Markdown alternate should be the actual article body — headlines, prose, tables, citations — without the page chrome. Don't include navigation menus, footer links, or cookie banners. Don't auto-generate from HTML by stripping tags; the result is usually messy. Hand-author or use a clean Markdown source as the canonical and render HTML from it.
Verifying Markdown alternates work end-to-end:
curl -sI https://your-brand.com/page/ | grep -i "Link\|Content-Type"
curl -s -H "Accept: text/markdown" https://your-brand.com/page/index.md | head -40
Structured data — Bing/Copilot signal, not universal
JSON-LD schema is widely sold as a "universal AI signal." That claim doesn't hold up in our experimental sample. Across 500+ multi-engine prompts in May 2026, JSON-LD presence was a measurable positive signal for Bing's AI surfaces (Copilot in particular) but not for ChatGPT, Claude, or Perplexity to a degree we could detect at our sample size. Independent work by SearchVIU reaches a consistent conclusion.
What this means in practice: ship JSON-LD because it helps Bing/Copilot, helps with Google rich results, and is cheap to produce. Do not justify the work as a universal AI lift, and do not deprioritise off-site source-graph work in favour of more JSON-LD.
The minimum useful schema set for a brand site:
- Organization on the homepage — name, URL, logo, sameAs links to LinkedIn / X / Wikidata if you have a Q-item.
- WebSite on the homepage — name, URL, search action.
- WebPage on every page — at least
@id,url,name,description,isPartOfthe WebSite. - BreadcrumbList on category / spoke pages — explicit breadcrumb trail.
- Article for blog/research posts; Service for tools; Product for catalogue pages where applicable.
What to skip: FAQPage JSON-LD on AI-citation-priority pages. The format invites pages to bolt on a "questions" section that adds no new information, lowering signal density. We have observed FAQPage-heavy pages underperform pages with the same source content reorganised as encyclopedic prose. This is a recommendation against form, not against substance — if your users actually have questions, answer them, but answer in prose.
Sitemap, canonical, and IndexNow — the boring discoverability layer
Three small things that quietly determine how fast new content reaches an answer engine's index:
Sitemap.xml
Ship one at /sitemap.xml. Reference it from robots.txt. Update <lastmod> when content changes — engines weight recently-modified URLs higher in their re-crawl queues. Don't list every internal asset; list the canonical pages you want crawled and indexed. For a site with 20–50 important pages, the file should be 20–50 entries. Anything 10× larger is usually accidental — staging URLs, duplicate query strings, paginated archives.
Canonical tags
Every page needs a <link rel="canonical" href="..."> pointing at the URL you want indexed. Common failures: canonical pointing at HTTP when site is HTTPS, canonical pointing at staging domain, canonical pointing at trailing-slash variant when sitemap uses no-slash variant. Engines will not aggregate signal across variants automatically; pick one form and use it consistently in canonical, sitemap, and internal links.
IndexNow
IndexNow is a Microsoft-led ping protocol that lets you proactively notify Bing (and any IndexNow-participating engine) when a URL is created, updated, or deleted. Drop a key file at /<your-key>.txt, then POST the URL list to https://api.indexnow.org/indexnow on every publish. Bing surfaces IndexNow-pinged URLs noticeably faster — typically same-day vs the standard crawl-budget cycle. Worth wiring into your build pipeline.
Common failure modes we see in audits
- Disallow: /_next/ blocking JavaScript bundles. Modern frameworks fetch JS chunks from
/_next/or/_app/— blocking these breaks rendered content for crawlers that execute JS (Googlebot, Bingbot Smart Crawler, recent versions of GPTBot). Result: bots see an empty shell. - Conflicting canonical and og:url. Canonical says one URL;
og:urlsays another. Some engines treatog:urlas the citable URL. Always make them match. - Markdown alternate that's auto-generated tag-stripped HTML. Looks like Markdown, parses badly. Use clean source-of-truth Markdown.
- Multiple sitemaps without a sitemap-index file. Engines may crawl only the first one. Ship a sitemap index.
- Crawl-delay directives intended for one bot applied to all. A 30-second crawl-delay set under
User-agent: *throttles every well-behaved AI crawler. Apply selectively or not at all. - llms.txt that lists only the homepage. The point is to surface the cluster — pillar, spokes, blog, about. A one-link
llms.txttells the AI "we don't have a curated index." - Cloudflare "Block AI Bots" rule applied broadly. Cloudflare's managed AI-bot block list includes named user-agents you may want to allow. Audit the rule before enabling. (Cloudflare's "Manage robots.txt" feature can also overwrite hand-authored
robots.txtif both are configured.)
How to verify your readiness end-to-end
A tight verification pass takes about 15 minutes:
- robots.txt fetch —
curl -sI https://your-brand.com/robots.txtreturns 200 and the body lists each AI bucket explicitly. - llms.txt fetch —
curl -sI https://your-brand.com/llms.txtreturns 200 withContent-Type: text/plainortext/markdown. - Markdown alternate — visit any page, view source, confirm
<link rel="alternate" type="text/markdown">declares an existing URL that returns 200 with Markdown body. - Canonical consistency — visit homepage and any inner page, verify canonical, og:url, and sitemap entry all match exactly (HTTP/HTTPS, www/no-www, trailing-slash).
- JSON-LD parses — paste page source into Google's Rich Results Test or Schema.org Markup Validator. Zero errors. (Warnings are usually fine.)
- Live AI test — open ChatGPT, Claude, Perplexity, and (if accessible to you) DeepSeek's chat interface. Ask "what is <brand-name>?" If the engine cites your domain in the answer or in the sources, the readiness layer is working. If it doesn't, the next bottleneck is content or third-party citations, not infrastructure.
Deeper reading on each readiness layer
Spoke pages with more depth on the individual signals covered above:
- llms.txt vs robots.txt — what each file does, where they overlap (rarely), and the precedence rule when both apply.
- Markdown alternates guide — how to ship per-page
.mdfiles alongside HTML across static, CMS, and edge-function stacks. - AI crawler blocking mistakes — the seven failure modes that hide pages from engines, with diagnostic commands for each.
- IndexNow setup guide — push notifications for Bing, Copilot and Yandex; five-step config, ~30 min total.
What comes after readiness
Crawler readiness gets your pages into the source pool. Two further layers determine whether they get cited:
- Selection-worthiness — does the page have specifics, named entities, recent dates, real numbers? Vague pages get pulled less often. The pillar page covers this in detail.
- Off-site substrate — third-party citations from Reddit, Zhihu, Wikipedia, vertical media. This is the highest-leverage and slowest layer. Our off-site substrate study walks through the source-graph evidence.
For brands targeting Mainland China specifically, see also the per-engine playbooks: DeepSeek SEO, Qwen optimization, Doubao optimization.
Where to go from here
If you want a confidential read of where your brand sits across all five infrastructure-layer signals plus the upstream citation-pyramid layers, run the free audit or book a 30-minute fit check.
Run the free audit Book a 30-min consultation
Or browse the China AI visibility pillar for the full topic map.