Modern HTML pages carry a lot of structural overhead: header navigation, sidebar links, related-article carousels, marketing CTAs, footer columns, GDPR banners, structured-data scripts, analytics tags. When an AI engine fetches the page to extract content for citation, it has to identify and discard all of that before reaching the body text. Some engines do this well; some do not. Even the ones that do well sometimes carry boilerplate strings into their answer paraphrases.
A Markdown alternate is the same content rendered to clean Markdown — body copy, headings, lists, code blocks, internal links — without the chrome. The engine can fetch it, parse it with near-zero noise, and ground its answer in the actual content. Engines that look for Markdown alternates explicitly include several AI assistants and indexers; engines that do not look for them still benefit from the linked declaration in HTML head when their content extractors hit a cleaner extraction path.
What this is not. Markdown alternates are not a ranking signal in the search-engine sense. Google does not boost pages with Markdown alternates. Bing does not. The benefit is downstream: when an engine cites your page, the citation is grounded in cleaner text. Treat this as
extraction quality, not
retrieval boost.
What to put in the .md file
The Markdown file is not a literal Markdown-to-HTML transform of the rendered page. It is a clean, content-only rendering. Strip:
- Site header, navigation, breadcrumbs (these belong in
llms.txt, not in every page)
- CTAs, banners, modal triggers, GDPR consent strings
- Sidebar promotional content, related-article carousels
- Footer link columns, social links, copyright lines
- Inline tracking, analytics scripts, ad slots
Keep:
- The page title as H1
- A short canonical/updated metadata block at the top (1–3 lines)
- Body headings and paragraphs as Markdown
- Lists, tables, code blocks, blockquotes
- Internal links (full URLs, not relative — engines parsing standalone
.md files don't always have base URL context)
- External citations as proper Markdown links
The Markdown file should be content-equivalent to the HTML page, not a stripped-down summary. If the HTML body has 1,800 words, the Markdown file should have ~1,800 words. Truncating defeats the purpose: engines that fetch the alternate will use what is there. Short alternates produce short citations.
Content-Type and HTTP details
The HTTP response for a Markdown alternate must declare Content-Type: text/markdown (or text/markdown; charset=utf-8). Several engines refuse to parse a .md file served as application/octet-stream or text/plain; they expect the registered media type.
On S3 the content type is set per object at upload time. Use aws s3 cp ... --content-type "text/markdown; charset=utf-8" explicitly for every .md file. On Cloudflare R2, set the upload header. On Netlify, configure _headers:
/*.md
Content-Type: text/markdown; charset=utf-8
On Vercel, configure vercel.json:
{
"headers": [
{
"source": "/(.*)\\.md",
"headers": [
{ "key": "Content-Type", "value": "text/markdown; charset=utf-8" }
]
}
]
}
On nginx or Apache, add a MIME-type mapping in the server config. Verify after deploy with curl -I https://www.example.com/foo/index.md — the response header should read Content-Type: text/markdown; charset=utf-8.
Per-page alternates vs a single llms-full.txt bundle
Per-page alternates and a single-file llms-full.txt bundle solve overlapping but distinct problems. Per-page alternates let an engine fetch the specific page it's citing, which keeps payload small and matches the URL it would otherwise cite. A bundled llms-full.txt lets an engine ingest the whole site in one fetch, which suits engines that prefer to context-load before answering rather than fetch-on-demand.
The recommendation: ship per-page alternates as the baseline. Add llms-full.txt as a supplement when the site has fewer than ~50 pages and the total Markdown corpus fits in a few hundred kilobytes. Don't ship llms-full.txt for sites with thousands of pages — engines that fetch a several-megabyte bundle on every visit will rate-limit themselves and cache aggressively.