Reference · Updated May 2026

How Eastbound measures China AI visibility.

A reference page on the methodology behind Eastbound's audits and research — the two-stage citation framework, our stratified zh-CN prompt panels, the engine endpoints we query, the reliability discipline we apply, and the labelling system we use for every recommendation.

The two-stage citation framework

Generative search has two stages, not one. The framework Eastbound uses is grounded in Zhang Kai & Yao Jingang's 2026 GEO measurement paper (arXiv:2604.25707v1), which separated citation selection (whether a page enters the engine's source pool) from citation absorption (whether the page actually shapes the answer language). The user-visible mention is a third stage downstream of both — a page can be selected often but absorbed weakly; a brand can be absorbed but mentioned only in a long-tail position.

Most generic AI-visibility tools collapse the three stages into one number. We do not, because the fix for each stage is different and a single score hides which layer is actually limiting your visibility. Our reports break out:

Selection — per-engine yes/no on whether your domain entered the source pool for category-relevant prompts.
Absorption — recall depth and language reuse, scored separately from selection.
Mention — whether the final answer named your brand, and how prominently.

Prompt-panel design

We run stratified zh-CN consumer-voice prompt panels — Mainland-Chinese natural-language questions a real consumer would ask their AI assistant in your category. The panel is stratified at two levels:

L1 — broad category

Questions a consumer asks at the category level: "best moisturiser for sensitive skin", "carry-on luggage under 1.5kg", "which mechanical-keyboard switch for typing". L1 prompts surface broad-category brand recommendations and capture how your category is mapped at the highest level.

L2 — positioning niche

Questions a consumer asks at your specific positioning niche: "Korean-style hyaluronic moisturiser for over-thirties", "polycarbonate hardshell with TSA lock under HK$1,500", "linear switches with pre-lubed stems for office use". L2 prompts capture whether your brand surfaces inside the more specific frame your positioning targets.

Each prompt is repeated multiple times per engine to control for run-to-run variance, and consumer-voice prompts are kept distinct from developer/B2B prompts because the source-mix patterns differ materially between consumer and developer queries on DeepSeek specifically.

Travel and hospitality categories use multi-turn panels (first turn is "where to go", follow-ups dig into accommodation, authenticity, payment, language), because single-shot prompt panels under-report how recommendation funnels actually unfold for these categories.

Engine endpoints and provider notes

We measure the live API endpoints of each engine, not scraped chat-product output. Provider details matter for reproducibility:

Engine	API endpoint	Model ID convention
DeepSeek	DeepSeek API (api.deepseek.com)	deepseek-chat (default), deepseek-reasoner (R1) when explicitly tested
Qwen	DashScope international (dashscope-intl.aliyuncs.com/compatible-mode/v1)	qwen-plus (default), qwen-max for high-reasoning runs
Doubao	BytePlus ModelArk international (ark.ap-southeast.bytepluses.com/api/v3)	Model IDs logged at session start and end

Two practical caveats we publish loudly:

Provider labels are commonly confused. Qwen runs on DashScope (Alibaba's API surface). Doubao runs on BytePlus ModelArk (ByteDance's). They are different engines on different infrastructure. Findings on one do not transfer to the other.
Neither endpoint exposes pinned-version handles. We log the model IDs at session start and at session end and report them in every readout, but we cannot guarantee identical model snapshots across runs over weeks. Test-retest reliability runs (described below) are how we control for this.

Reliability discipline

Any AI-visibility number is only as good as its reproducibility. We re-run identical prompt panels at controlled intervals and report multiple reliability statistics in every readout — not just the headline number that flatters the result.

Test-retest reliability we report

Top-5 source membership stability (κ_top-5). Are the same five sources cited at the highest rates across consecutive runs? Across our most recent 30-prompt panel re-run, all three engines hit κ_top-5 = 1.00 — the top-5 sources are perfectly stable.
Top-15 source membership stability (κ_top-15). Are the next 10 sources stable? Here the engines differ: DeepSeek κ_top-15 = 0.89, Qwen κ_top-15 = 0.78, Doubao κ_top-15 = 0.46. Doubao's long-tail source ranking is materially less stable than the other two — a granular-tag normalisation issue we document explicitly. We treat top-5 with high confidence and the long tail with the appropriate hedge.
Pearson r and ICC. Source mention rates correlated at Pearson r 0.97–0.99 across all three LLMs (ICC(2,1) 0.97–0.99) on identical 30-prompt re-runs.

A reliability table that reports only κ_top-5 (where everyone scores 1.00) and hides κ_top-15 (where Doubao shows the granular instability) is reporting selectively. We disclose Doubao's κ_top-15 = 0.46 even though it is the harder story to tell, because anyone paying for our work deserves to know.

Sample-size discipline

A 5-niche probe with 125 calls and a 540-call panel are not the same evidence base. We always report n, panel coverage and the categories the sample was drawn from. We do not generalise findings from one category panel to others without separate measurement; for example, the SMZDM 72% mention rate we observed on a handbag panel does not transfer to watches or luggage, and within handbags it collapses at the ultra-luxury price tier (33% in our re-cut).

Recommendation labelling — measured / hypothesis / intervention

Every public recommendation Eastbound makes is labelled as one of three states. We do not collapse the three because the evidence cost of each is materially different:

Measured evidence. We observed this in our own panel. We state n, panel structure, and the engines tested. We disclose limitations (single-LLM probe vs multi-LLM, descriptive vs causal, category coverage).
Prior-knowledge hypothesis. Consistent with published research (Aggarwal et al. KDD 2024, Zhang Kai & Yao Jingang arXiv 2604.25707v1, the geo-citation-lab dataset, Tw93's practitioner article, etc.) but Eastbound has not measured it directly. Cited with attribution; framed as hypothesis.
Planned intervention test. We expect the change to help but the only evidence that proves it is before/after measurement on your own brand. We design the test, set the measurement date, and report the result honestly — including null and negative outcomes.

When a vendor claims "AI visibility lift in 7 days" or "guaranteed mentions", they are almost always conflating these three categories. Marketing pressure tends to convert cited research into "we proved this", and untested intervention into "this works". Eastbound's edge is the discipline of refusing those conversions.

What we do not claim

For completeness, the things we deliberately do not claim:

We do not inspect any engine's training corpus. We can only measure what each engine self-attributes when answering. Self-attribution is not the same as training-data composition.
We do not measure ChatGPT, Claude, Gemini or Perplexity with the same methodology used for the Chinese engines. Our China-engine work does not transfer to those Western engines without separate measurement against an English-language prompt panel.
We do not measure sales, conversion or attributable revenue. AI mention is a brand-visibility signal. Referral traffic from AI search is currently under 1% of total referral traffic globally, though intent quality is materially higher when it does occur.
We do not claim that JSON-LD schema makes Chinese AI engines find you. JSON-LD is a Bing/Copilot index-enrichment signal in our experimental sample, not observed driving DeepSeek / Qwen / Doubao citations.
We do not promise specific ranking positions, ever. The fastest-moving layer (technical hygiene) takes days–weeks to register; the highest-leverage layer (third-party source-graph) compounds over quarters.

Why this matters operationally. Anti-overclaim is not a brand voice — it is a methodology requirement. Once a measurement framework starts conflating measured / hypothesis / intervention, the numbers stop being useful internally. Clients should pressure Eastbound to label every recommendation; we welcome the pressure.

Run the audit

The free Eastbound audit applies the methodology described above on a smaller prompt panel for your specific URL, across DeepSeek + Qwen + Doubao. It returns the per-stage selection / absorption / mention scores plus the highest-leverage fixes.

Run free audit

Or read the China AI visibility pillar, our research index, or the free AI visibility audit.