# AI Crawler Blocking Mistakes — the seven failure modes that hide pages from engines

**Canonical:** https://www.eastbound.ai/ai-crawler-blocking-mistakes/
**Updated:** 2026-05-07

Most teams that "have AI visibility problems" turn out to have AI *fetch* problems. The crawler can't get to the page, can't parse what it gets, or can't tell what's there. Seven failure modes account for nearly all of it.

## 1. Overbroad robots.txt rules that catch search bots too

The most common failure. A team wants to opt out of AI training, so they add a blanket rule:

```
User-agent: *
Disallow: /
```

This blocks *every* compliant bot — Googlebot, Bingbot, OAI-SearchBot, ClaudeBot, PerplexityBot. The site disappears from search and AI assistants alike.

**Diagnostic:** fetch `https://www.example.com/robots.txt` and audit each `User-agent:` group. Verify with Google's robots tester (Search Console → Settings) and Bing's equivalent.

**Fix:** name bots explicitly. Allow at `User-agent: *`; disallow specific training bots only (`GPTBot`, `anthropic-ai`, `CCBot`) if your editorial position is opt-out-of-training.

## 2. Cloudflare Bot Fight Mode or strict managed challenges

Cloudflare's *Bot Fight Mode* (and *Super Bot Fight Mode*) issues challenges to clients that don't pass the heuristic for a real browser. AI crawlers with clean user-agents that respect `robots.txt` are categorised as "verified bots" and pass — but a long tail of less-recognised AI fetchers, including some that ChatGPT and Claude use for ad-hoc URL fetches, get challenged. A challenge returns HTML that says "Verifying you are human..." instead of your content.

**Diagnostic:** `curl -s -A "GPTBot" https://www.example.com/` and `curl -s -A "ChatGPT-User" https://www.example.com/`. If either returns a Cloudflare challenge page, you're blocking.

**Fix:** in Cloudflare → Security → Bots, disable Bot Fight Mode for content paths. If you need bot mitigation, use *Bot Management* (paid tier) and explicitly allow named AI user-agents.

## 3. JavaScript-only rendering with no SSR / no static fallback

A single-page application that renders content client-side without server-side rendering presents a near-empty HTML body to any fetcher that doesn't execute JavaScript. Most AI crawlers do not execute JS — they fetch the HTML, parse it, and move on.

**Diagnostic:** `curl -s https://www.example.com/foo/ | grep -c "<p>"`. If the count is near zero on a page that visually has body copy, the content is JS-rendered. Or disable JavaScript in your browser and reload — what you see is what AI crawlers see.

**Fix:** ship server-side rendering (Next.js, Nuxt, Remix, SvelteKit) or static site generation (Hugo, Astro, 11ty). For existing SPA sites, prerender critical pages at build time using Prerender.io or react-snap.

## 4. Geo-blocking by country / IP range

Several AI engines crawl from regions teams routinely block by accident. OpenAI's training and search crawlers operate from major US cloud regions. Anthropic's `ClaudeBot` crawls from Cloudflare-fronted ranges. Mainland-Chinese AI engines like DeepSeek, Qwen and Doubao crawl from Mainland-CN IP space.

**Diagnostic:** for each engine you care about, check whether your edge / WAF / CDN blocks the engine's known IP ranges. For Mainland-CN engines, test from a Mainland-CN VPN.

**Fix:** allowlist named AI bot user-agents and IP ranges at the WAF layer. For Mainland-CN AI visibility, you generally need a Mainland-Chinese ICP-licensed origin or CDN edge in Mainland-CN — see [/china-ai-visibility/](https://www.eastbound.ai/china-ai-visibility/).

## 5. Login walls or aggressive paywalls without preview-text

A page that returns content only after a login form is fetched as a near-empty HTML body containing the login UI. AI crawlers don't sign in.

Soft paywalls — those showing the first 100 words and "subscribe to read more" — are partially compatible: engines use the visible preview, but the citation is limited to that. A hard paywall returning 401/403 is invisible.

**Diagnostic:** fetch the URL with `curl` in an incognito session. If you don't see body copy, neither does the engine.

**Fix:** ship enough preview text in the HTML body before the paywall to be useful — at least 300–500 words of substantive content. Mark gated content with structured-data `isAccessibleForFree: false`.

## 6. Slow render times that exhaust the crawler's fetch budget

AI crawlers operate with timeouts measured in seconds. A page that takes 8–15 seconds to render may be abandoned mid-stream, leaving the crawler with truncated content.

**Diagnostic:** `curl -w "%{time_total}\n" -o /dev/null -s https://www.example.com/foo/`. Anything above 3 seconds is risky; above 6 seconds is likely to be truncated.

**Fix:** serve HTML from the edge (Cloudflare, Vercel Edge, CloudFront), pre-render static pages, defer non-critical scripts, eliminate render-blocking resources.

## 7. Content Security Policy and X-Robots-Tag misconfigurations

`X-Robots-Tag: noindex, nofollow` set globally at the CDN layer is catastrophic — it tells every crawler not to index. Teams sometimes set this on staging and accidentally inherit it to production. A related failure: `<meta name="robots" content="noindex">` in HTML head, the default on WordPress installs that haven't toggled "Discourage search engines" off.

**Diagnostic:** `curl -I https://www.example.com/foo/` and check `X-Robots-Tag`. Grep the HTML for `name="robots"`. Verify in Google Search Console → URL Inspection.

**Fix:** remove `X-Robots-Tag: noindex` from production. Remove the meta tag. If a page should be noindexed, keep the rule but exclude it from `sitemap.xml` and `llms.txt`.

## The standard diagnostic flow

When a team reports "we're not showing up in [engine]", run this sequence:

1. `curl -sI https://www.example.com/foo/` — status, content-type, X-Robots-Tag
2. `curl -s -A "GPTBot" https://www.example.com/foo/` — check for Cloudflare challenge
3. `curl -s https://www.example.com/foo/ | grep "<p>"` — body copy server-rendered
4. `curl -s https://www.example.com/robots.txt` — overbroad Disallow
5. `curl -s https://www.example.com/foo/ | grep -i "robots"` — meta robots
6. `curl -w "%{time_total}\n" -o /dev/null https://www.example.com/foo/` — render time
7. Test from relevant geographic region — geo-blocking

If all seven pass, the failure isn't fetch — it's content quality, on-site signals, or off-site source-graph. See the [China AI visibility pillar](https://www.eastbound.ai/china-ai-visibility/) and [methodology](https://www.eastbound.ai/methodology/).

## Related reading

- [AI crawler readiness — the infrastructure layer](https://www.eastbound.ai/ai-crawler-readiness/)
- [llms.txt vs robots.txt](https://www.eastbound.ai/llms-txt-vs-robots-txt/)
- [Markdown alternates guide](https://www.eastbound.ai/markdown-alternates-guide/)
- [IndexNow setup guide](https://www.eastbound.ai/indexnow-setup-guide/)
- [Run an AI visibility audit](https://www.eastbound.ai/ai-visibility-audit/)
