AI crawler readiness · Diagnostic

AI crawler blocking mistakes — the seven failure modes.

Most teams that "have AI visibility problems" turn out to have AI fetch problems. The crawler can't get to the page, can't parse what it gets, or can't tell what's there. Seven failure modes account for nearly all of it. This page is the diagnostic — what to look for, how to test, and how to fix.

Companion to AI crawler readiness. Updated 2026-05-07.

1. Overbroad `robots.txt` rules that catch search bots too

The most common failure. A team wants to opt out of AI training, so they add a blanket rule:

User-agent: *
Disallow: /

This blocks every compliant bot — Googlebot, Bingbot, OAI-SearchBot, ClaudeBot, PerplexityBot. The site disappears from search and AI assistants alike. The mistake is using User-agent: * when the intent was a specific bot category.

A subtler version: a regex-style rule like Disallow: /api/ matches /api/foo but is sometimes copied as Disallow: /api, which also matches /apiary and any other path starting with that prefix. robots.txt matching is prefix-based, not glob-based.

Diagnostic: fetch https://www.example.com/robots.txt and audit each User-agent: group. Verify with Google's robots tester (Search Console → Settings → Open report → robots.txt) and Bing's equivalent. Fix: name bots explicitly. Allow at User-agent: *; disallow specific training bots only (GPTBot, anthropic-ai, CCBot) if your editorial position is opt-out-of-training.

2. Cloudflare "Bot Fight Mode" or strict managed challenges

Cloudflare's Bot Fight Mode (and the more aggressive Super Bot Fight Mode) issues challenges to clients that don't pass the heuristic for a real browser. AI crawlers that present a clean user-agent and respect robots.txt are categorised as "verified bots" by Cloudflare and pass — but a long tail of less-recognised AI fetchers, including some that ChatGPT and Claude use for ad-hoc URL fetches, get challenged. A challenge returns HTML that says "Verifying you are human..." instead of your content. The engine cites the challenge page or — more often — silently drops the URL.

Diagnostic: run curl -s -A "GPTBot" https://www.example.com/ and curl -s -A "ChatGPT-User" https://www.example.com/. If either returns a Cloudflare challenge page rather than your HTML, you're blocking. Also test with custom user-agents that don't appear on the verified-bot list.

Fix: in Cloudflare → Security → Bots, disable Bot Fight Mode at the very least for your content paths. If you need bot mitigation, use the more granular Bot Management (paid tier) and explicitly allow named AI user-agents. The blanket challenge approach is too noisy for content sites.

3. JavaScript-only rendering with no SSR / no static fallback

A single-page application that renders content client-side via React, Vue, or Angular without server-side rendering presents a near-empty HTML body to any fetcher that doesn't execute JavaScript. Most AI crawlers do not execute JS — they fetch the HTML, parse it, and move on. They see your <div id="root"></div> with no content inside.

Google's modern crawler executes JS for indexing, but with delay; AI engines mostly do not. The result: the site indexes for traditional search but is invisible to AI search.

Diagnostic: curl -s https://www.example.com/foo/ | grep -c "<p>". If the count is near zero on a page that visually has body copy, the content is JS-rendered. Disable JavaScript in your browser and reload — what you see is what AI crawlers see.

Fix: ship server-side rendering (Next.js, Nuxt, Remix, SvelteKit) or static site generation (Hugo, Astro, 11ty) that emits real HTML. For existing SPA sites, prerender critical pages (homepage, pillar, product, top blog posts) at build time using a tool like Prerender.io or react-snap.

4. Geo-blocking by country / IP range

Several AI engines crawl from regions teams routinely block by accident. OpenAI's training and search crawlers operate from major US cloud regions. Anthropic's ClaudeBot crawls from Cloudflare-fronted ranges. Mainland-Chinese AI engines like DeepSeek, Qwen and Doubao crawl from Mainland-CN IP space. A site that blocks Mainland-CN traffic for compliance reasons is invisible to those engines. A site that blocks all non-EU traffic to comply with a misread of GDPR is invisible to most US-hosted AI crawlers.

Diagnostic: for each engine you care about, check whether your edge / WAF / CDN blocks the engine's known IP ranges. Cloudflare's "verified bot" list covers most Western AI bots. For Mainland-CN engines, test from a Mainland-CN VPN — if your homepage doesn't load, neither does it for the engine.

Fix: allowlist named AI bot user-agents and IP ranges at the WAF layer. For Mainland-CN AI visibility specifically, you generally need a Mainland-Chinese ICP-licensed origin or CDN edge in Mainland-CN to be reliably reachable — see the China AI visibility pillar for the full discussion.

5. Login walls or aggressive paywalls without preview-text

A page that returns its content only after a login form is fetched as a near-empty HTML body containing the login UI. AI crawlers don't sign in. They cite the login page if anything, or skip the URL.

Soft paywalls — those that show the first 100 words and then a "subscribe to read more" — are partially compatible: engines can use the visible preview, but the citation will be limited to that preview. A hard paywall that returns a 401 or 403 is invisible.

Diagnostic: fetch the URL with curl in an incognito session. If you don't see body copy, neither does the engine.

Fix: for content you want cited, ship enough preview text in the HTML body before the paywall to be useful in a citation — at least 300–500 words of substantive content. Mark gated content with structured-data isAccessibleForFree: false so engines know what's gated.

6. Slow render times that exhaust the crawler's fetch budget

AI crawlers operate with timeouts measured in seconds. A page that takes 8–15 seconds to render is fetched, but the crawler may abandon the response mid-stream and end up with truncated content. Pages that depend on cascading resource loads (synchronous third-party scripts, render-blocking webfonts, slow database queries) hit this often.

Diagnostic: time the page from cold cache. curl -w "%{time_total}\n" -o /dev/null -s https://www.example.com/foo/. Anything above 3 seconds is risky; above 6 seconds is likely to be truncated by some engines. Run Lighthouse and check the Largest Contentful Paint metric.

Fix: serve HTML from the edge (Cloudflare, Vercel Edge, AWS CloudFront), pre-render static pages, defer non-critical scripts, eliminate render-blocking resources. The same investments that improve Core Web Vitals improve crawler success.

7. Content Security Policy and X-Robots-Tag misconfigurations

A Content-Security-Policy header that disallows inline scripts is a non-issue for fetching, but a X-Robots-Tag: noindex, nofollow set globally at the CDN layer is catastrophic — it tells every crawler not to index the page. Teams sometimes set this on staging environments and accidentally inherit the rule to production.

A related failure: <meta name="robots" content="noindex"> in HTML head. This appears on default WordPress installs that haven't toggled "Discourage search engines from indexing this site" off after launch.

Diagnostic: curl -I https://www.example.com/foo/ and check the response headers for X-Robots-Tag. Fetch the HTML body and grep for name="robots". Verify in Google Search Console → URL Inspection — if the inspection report shows "Crawl status: blocked", this is the cause.

Fix: remove X-Robots-Tag: noindex from production responses. Remove <meta name="robots" content="noindex">. If you legitimately want a page noindexed (e.g. internal admin tooling), keep the rule but exclude the URL from sitemap.xml and llms.txt.

The standard diagnostic flow

When a team reports "we're not showing up in [engine]", run this sequence in order. Each step takes minutes and rules out one of the seven failure modes:

curl -sI https://www.example.com/foo/ — check status, content-type, X-Robots-Tag.
curl -s -A "GPTBot" https://www.example.com/foo/ — check that you don't get a Cloudflare challenge for a named AI UA.
curl -s https://www.example.com/foo/ | grep "<p>" — check that body copy is server-rendered, not JS-only.
curl -s https://www.example.com/robots.txt — check for overbroad Disallow:.
curl -s https://www.example.com/foo/ | grep -i "robots" — check for meta name="robots".
curl -w "%{time_total}\n" -o /dev/null https://www.example.com/foo/ — check render time.
Test from a relevant geographic region (US east coast for Western AI; Mainland-CN VPN for Chinese engines) — check geo-blocking.

If all seven pass, the failure isn't fetch — it's content quality, on-site signals, or off-site source-graph. That's a separate workstream. See the China AI visibility pillar and the methodology page.

Run the diagnostic against your own site

The AI visibility audit bundles these seven checks plus a category-relevant prompt-panel measurement against DeepSeek, Qwen and Doubao. The output identifies which failure mode is in play and what fixing it would unlock.

Run my audit →