China AI visibility · How-to playbook

How to improve brand visibility in AI search engines

Improving brand visibility in AI search engines breaks into three layers. Layer 1 (technical infrastructure) is a 1-hour job. Layer 2 (content design) is a multi-week job. Layer 3 (off-site source-graph) is a multi-quarter job that compounds. Each layer is measured by its own published evidence, and three of the most common tactics — FAQPage schema, "increase content length", and JSON-LD as a universal AI signal — do not appear in the validated set. Here is the working playbook.

Last reviewed 2026-05-10. Tactics drawn from Aggarwal et al. KDD 2024, Zhang & Yao Jingang 2026, SE Ranking 2025, Williams-Cook 2026, and our own 540-call panel.

The framework — three things to optimise, in order

Generative AI engines do not work like Google's blue-link results. The clearest published model — Zhang Kai & Yao Jingang 2026 (arXiv:2604.25707v1) — separates the process into citation selection (does the engine retrieve your page into its source pool?) and citation absorption (does the page's language, structure or facts actually shape the answer the user reads?). User-visible mention is a third stage downstream of both.

Tw93's 2026 instrumentation of ChatGPT made the gap concrete: the engine retrieves roughly 100 pages per query, but only ~15% surface in the answer. The other 85% are selected but not absorbed. So three different metrics, with three different optimisation tactics for each. Get them in the right order.

Layer 1 — Technical infrastructure. Without this, your pages are not selected. Hours of work.
Layer 2 — Content design. Without this, your selected pages are not absorbed. Weeks of work.
Layer 3 — Off-site source-graph. Without this, your absorbed pages are referenced rarely. Quarters of work, compounding.

Layer 1 — Technical infrastructure (1 hour)

The selection floor. If your pages are not crawlable, indexable, and parseable by the AI bots, none of the rest matters. Vercel's 2025 crawler study confirmed GPTBot, ClaudeBot and PerplexityBot fetch raw HTML and do not execute JavaScript. The implications are concrete:

1. Granular robots.txt

Most sites have one User-agent: * block and call it done. Modern AI engines run multiple user agents — separate ones for training crawl, retrieval crawl, and user-triggered fetches. AI crawler readiness is the configuration reference: explicit Allow rules for OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User, etc. Block training-only bots if you want; do not accidentally block retrieval bots that drive citation.

2. llms.txt and llms-full.txt

A positive index pointing AI engines at the content you want them to read. Different problem from robots.txt (negative gate). See llms.txt vs robots.txt for the disambiguation, and Markdown alternates guide for serving per-page Markdown alongside HTML.

3. JavaScript-only rendering kills you

GPTBot / ClaudeBot / PerplexityBot fetch raw HTML, no JS execution. SPA frameworks that hydrate client-side leave the bots with empty pages. Server-side render anything you want cited. Diagnostic: fetch your URL with curl in an incognito session — if you don't see body copy, neither does the engine.

4. IndexNow, sitemap.xml, canonical tags

IndexNow (Bing/Copilot) accelerates indexing on every content change. Sitemap covers Google AI Overview's underlying Google index. Canonical tags prevent duplicate-content splits. IndexNow setup guide covers the deployment.

5. The seven blocking mistakes

Most teams trip on overbroad robots.txt rules, Cloudflare bot-fight mode, JavaScript-only rendering, geo-blocking, login walls, slow render times, or CSP misconfigurations that block crawlers. AI crawler blocking mistakes walks through each.

Layer 1 deliverable. Pages return 200 to all major AI bot user-agents, render full HTML body without JavaScript, ship llms.txt + Markdown alternates, and ping IndexNow on every publish. If any of those is broken, fix it before doing anything else.

Layer 2 — Content design (multi-week)

The absorption layer. Your pages are now reachable; the question is whether their language and structure get extracted into answers. Aggarwal et al., KDD 2024 ran a 10,000-query benchmark across nine optimisation tactics and reported the headline numbers most of the field now cites. Three tactics produced statistically reliable lifts:

Tactic	Citation lift
Adding authoritative citations to your page	+115%
Adding direct quotes from credible sources	+43%
Adding relevant statistics with named sources	+33%

Notably absent from the validated set: FAQ format, FAQPage schema, generic "increase content length" advice. The absence is itself a finding — these are the most common heuristics in the field that the academic measurement work has not validated.

1. Length — the 1,000–3,000-word sweet spot

Cross-study consensus places the sweet spot at 1,000–3,000 words per reference page with 10+ headings. Below 500 words, pages function as snippets that rarely match a substantive prompt. Above 3,000 words, marginal value falls and the editorial cost of keeping the page accurate compounds. Low-cited pages average 170 words in published samples; high-cited pages average ~2,000 — a more-than-10× gap.

2. Specificity beats fluency

The strongest single predictor across studies is semantic similarity between page content and user query. Pages with real numbers, dated comparisons, named entities and clear definitions are cited 50%+ more than vague pages making the same claim. Step-structured content (numbered procedures, decision trees) outperforms prose summaries.

3. Encyclopedia-style explainer pages outperform news

Wikipedia-style "what is X / how does X work" pages have roughly 3× the influence per citation of news pages in published samples. The mechanism: an explainer page is reusable across many prompts; a news page is locked to a single window of relevance.

4. What does NOT work

FAQPage schema. SE Ranking's 129K-domain × 216K-page analysis (covered in Search Engine Journal, 2025) found FAQ-schema pages averaged 3.6 ChatGPT citations versus 4.2 without. Williams-Cook's 2026 controlled test confirmed FAQPage JSON-LD confers no extraction advantage over visible Q&A copy. Skip it.
JSON-LD as a universal AI signal. Williams-Cook's fake-schema test showed ChatGPT and Perplexity tokenise JSON-LD as plain text without structural parsing. Bing/Copilot is the confirmed exception — Microsoft's Fabrice Canel publicly confirmed Bing uses schema for Copilot grounding (SMX Munich, March 2025). Keep schema for that bonus; do not invest expanding it as primary tactic.
Padding to "increase content length." Length only helps when each chunk adds independent information. Padding lowers signal-to-noise and reduces citation rates.
User-Agent sniffing to serve different content to bots. Cloaking. Penalised.

Layer 3 — Off-site source-graph (multi-quarter)

The compounding moat. The single highest-leverage finding in the published research:

Brands cited by third parties are referenced roughly 6.5× more often than brands cited only on their own domain. A Reddit thread saying "I use BrandX because…" carries more weight than the same sentence on the brand's About page. A Wikipedia entry with citations to the brand's work is the highest-trust signal short of academic citation.

This is the structural reason brand visibility in AI search is a brand-presence play, not a content-marketing play. The infrastructure work (Layer 1) is a one-hour layer. The content work (Layer 2) is a multi-week layer. The third-party source-graph work — Wikipedia, Reddit, Hacker News, vertical media for Western engines, and the Mainland Chinese stack (百度百科, 知乎, 微信公众号, 36氪, 虎嗅, 小红书, SMZDM) for Chinese engines — is a multi-quarter layer that compounds.

Western source-graph priorities

Wikipedia — surfaced in 21% of DeepSeek responses across our luxury-handbag prompt panel; the highest-mentioned Western source in the sample. See Wikipedia AI visibility.
Reddit, Hacker News, GitHub — community trust signals. ChatGPT and Perplexity weight Reddit unusually heavily.
YouTube — surfaced in 20% of DeepSeek responses despite the geo-block. See YouTube AI visibility.
Vertical industry publications — Search Engine Land, Search Engine Journal, Ahrefs blog, Semrush blog cite each other constantly; getting placed there compounds.

Chinese source-graph priorities (different ecosystem entirely)

百度百科 — Mainland encyclopedia anchor. Different rules from Wikipedia.
知乎 — long-form Q&A. See Zhihu AI visibility.
小红书 (Xiaohongshu) — lifestyle and B2C. See Xiaohongshu AI visibility.
SMZDM (什么值得买) — commerce aggregator with high citation rate at aspirational tiers, low at ultra-luxury. See SMZDM AI visibility.
WeChat 公众号 + 36氪 / 虎嗅 — Mainland tier-2 vertical media.

For the long version see Traditional SEO won't get you into Chinese AI answers — the 3,562-word study on which third-party signals matter for Chinese engines specifically.

How to measure progress

Three metrics, each measured separately:

Selection rate. What percentage of relevant prompts pull your domain into the engine's source pool? (Top-15 cited sources.)
Absorption rate. What percentage of selections produce extractable content in the answer? (Engine quotes you, paraphrases your phrasing, or pulls a number from your page.)
Mention rate. What percentage of relevant prompts result in a user-visible mention of your brand name?

Eastbound's measurement methodology details how we measure all three — stratified zh-CN consumer prompt panels, top-5 + top-15 reliability stats (κ), measured / hypothesis / planned-test labels for every claim. The free AI visibility audit runs your domain through this against DeepSeek, Qwen and Doubao.

Realistic timeline

Layer	Time investment	Compounding?	Single biggest blocker
1. Technical infrastructure	1 hour to 1 day	No — set-and-forget	JavaScript-only rendering
2. Content design	2–8 weeks per reference page	Per-page yes; cross-page yes for evergreen pages	Editorial discipline
3. Off-site source-graph	1–3 quarters minimum	Yes — strongly compounding	Cannot be bought; relationship-driven

Most brands rush Layer 2, skip Layer 3, and never circle back to Layer 1. The opposite order is correct: Layer 1 first, Layer 2 second, Layer 3 last but plan it from day 1.

China is a separate execution problem

The framework above (selection / absorption / mention) is engine-agnostic. The platforms and source-graph in Layer 3 are not. In our 540-call panel (May 2026), top-15 cited-source overlap (Jaccard) between any two Chinese engines was 0.20–0.30 — and overlap between Western and Chinese engines is lower still. A source-graph plan built for ChatGPT cannot be ported to DeepSeek without rebuilding from scratch — different language, different platforms, different community norms. See China AI visibility for global brands for the dedicated Chinese-engine treatment.

Run the audit, then work the layers

The free Eastbound audit reports your selection / absorption / mention scores against DeepSeek, Qwen and Doubao on a stratified zh-CN consumer prompt panel. From there the next move is concrete.

Run AI visibility audit or book a 30-minute fit check.