Does robots.txt stop AI from crawling or using my content?

Only partially, and only for compliant bots. Robots.txt is voluntary by design (RFC 9309 standardizes the format, not enforcement), so well-behaved declared crawlers like GPTBot honor it while nothing compels others. The operators themselves document carve-outs: OpenAI states robots.txt rules 'may not apply' to its user-initiated ChatGPT-User fetcher, and Perplexity states Perplexity-User 'generally ignores' robots.txt. Cloudflare additionally documented undeclared stealth crawling in August 2025. And blocking is not retroactive: disallowing a training crawler today does not remove content from models already trained.

Does blocking a page in robots.txt remove it from Google or AI search?

No. Google's own documentation is explicit that robots.txt 'is not a mechanism for keeping a web page out of Google': a disallowed URL can still be indexed if other sites link to it, appearing in results without a description. There is also a structural trap: a noindex directive only works if the crawler can fetch the page to see it, so robots.txt-blocking a page prevents the engine from reading the very directive that would deindex it. To keep a page out of results, use noindex or authentication, not a Disallow rule.

Should I block AI crawlers in robots.txt?

Decide per crawler purpose, not as one switch. Blocking retrieval crawlers (OAI-SearchBot, PerplexityBot) removes you from those engines' citation surfaces; OpenAI states blocked sites 'will not be shown in ChatGPT search answers.' Blocking training crawlers (GPTBot, ClaudeBot) or setting control tokens (Google-Extended) is a separate licensing and IP decision that does not affect search visibility, with one trap: Google-Extended governs Gemini and Vertex training, not AI Overviews, which use Googlebot. A common deliberate policy is to allow retrieval and block training.

What is the difference between robots.txt, llms.txt, and AIPREF?

They answer different questions at different layers. Robots.txt is fetch access: may a crawler retrieve this URL at all. llms.txt is guidance: a curated, AI-readable map of a site's important content, with no access semantics. AIPREF (the Content-Usage signal) is usage preference: given the content, what may it be used for, such as training versus search. AIPREF can even ride inside robots.txt as a proposed Content-Usage rule, so the layers coexist in one file. None of the three verify who is asking; that is Web Bot Auth's lane.

/terms/robots-txt · 5 min read · foundational

Robots.txt (Robots Exclusion Protocol)

Robots.txt is the plain-text file at a site's root that tells crawlers which URL paths they may fetch, standardized as the Robots Exclusion Protocol in RFC 9309 (2022). It is a voluntary request, not an enforcement mechanism: compliant bots honor it, blocking crawling does not remove already-indexed URLs or express usage preferences, blocking is not retroactive for model training, and the major AI engines' user-initiated fetchers are documented by their own operators as partially exempt. In the AI era it is the right first lever and structurally incapable of being the whole answer.

Citation status

ChatGPT0×PerplexityClaude0×Copilot0×Gemini0×

Last checked 2026-07-13

Robots.txt is the plain-text file at a site's root (https://example.com/robots.txt) that tells crawlers which URL paths they may fetch. Proposed by Martijn Koster in February 1994 and a de facto convention for nearly three decades, it was standardized as the Robots Exclusion Protocol (REP) in RFC 9309 in September 2022¹². The file's vocabulary is small: User-agent lines name which bot the following rules address, Disallow and Allow lines list path prefixes, and a Sitemap line points to the XML sitemap. That is essentially the whole protocol.

What robots.txt is not is an enforcement mechanism, and in the AI era that distinction carries most of the weight. RFC 9309 standardizes the file format and parsing rules; compliance is voluntary, and the protocol cannot compel any crawler to obey it¹. Google's own documentation states plainly that robots.txt "is not a mechanism for keeping a web page out of Google"³. A Disallow line is a request that well-behaved bots honor. Everything beyond that, who is actually asking, what they do with content they already have, and what happens when they ignore the request, belongs to other layers of AI access control.

Minimum viable robots.txt

# https://example.com/robots.txt
User-agent: *            # rules for every bot
Disallow: /drafts/       # please do not fetch these paths

User-agent: GPTBot       # rules for one named crawler
Disallow: /              # asks OpenAI's training crawler to stay out

Sitemap: https://example.com/sitemap.xml

Rules bind (voluntarily) per named user agent; the full per-engine table of AI crawler names and their documented purposes lives in the AI crawler bots entry.

Status in 2026

Robots.txt is simultaneously at its most standardized and its most stressed. The standard half: RFC 9309 has been a Proposed Standard since September 2022, the major search and AI operators document robots.txt support for their declared crawlers (xAI's Grok is the documented exception: no first-party crawler contract), and new proposed machine-readable usage-preference signals are being designed to ride inside the same file; the IETF's AIPREF work attaches its proposed Content-Usage rule to robots.txt as a carrier, three decades after the format was sketched on a mailing list².

The stressed half is specific and well documented, mostly by the AI operators themselves. The "does robots.txt stop AI" question decomposes into four gaps. First, compliance is voluntary by design: declared crawlers like GPTBot and ClaudeBot document that they honor it, but nothing in the protocol compels anyone¹. Second, OpenAI and Perplexity document partial exemptions for their user-initiated fetchers: OpenAI states robots.txt rules "may not apply" to ChatGPT-User because a human asked for the page, and Perplexity states Perplexity-User "generally ignores" robots.txt for the same reason⁴. Note that this exemption is the operators' own classification (user-initiated fetches framed as browser-like use outside the protocol's scope), not a neutral reading of RFC 9309; the dispute below challenges exactly that boundary. Third, beyond the documented carve-outs sits undeclared crawling: Cloudflare's August 2025 investigation found Perplexity fetching never-indexed test domains through an undeclared Chrome-impersonating user agent after the declared crawler was disallowed; Perplexity disputed the framing, and it remains a single documented incident rather than a measured general rule⁵. Fourth, blocking is not retroactive: a Disallow: / for a training crawler stops future collection by that compliant bot; it removes nothing from models already trained.

There is also a quieter trap on the search side: blocking a URL in robots.txt does not deindex it. Google documents that a disallowed page can still be indexed if linked from elsewhere, and a page the crawler cannot fetch cannot show the engine its noindex directive, so robots.txt-blocking a page disables the one mechanism that would actually remove it³.

How to apply

Robots.txt is the first lever of AI access control, not the last. Three concrete moves:

Audit what your file actually asks. Fetch your own /robots.txt and read it as a bot would: which user agents are named, which paths are disallowed for *. Validate behavior with Search Console's robots.txt report, and remember rules are matched per named agent, not inherited across them.
Split retrieval from training deliberately. Allowing retrieval bots (OAI-SearchBot, PerplexityBot) keeps you eligible for the citation surfaces those crawlers feed (eligibility, not a guarantee of citation); blocking training bots (GPTBot, ClaudeBot) or setting the Google-Extended control token is an independent licensing decision. Watch the trap: Google-Extended governs Gemini/Vertex training, not AI Overviews, which use Googlebot. Per-engine names and purposes: AI crawler bots.
Pair it with enforcement where it matters. For actors that ignore requests, the protocol has no answer; that layer is bot management and WAF rules (and, for verifying who is actually asking, the emerging Web Bot Auth). Treat robots.txt as the documented, machine-readable statement of crawl policy that compliant bots will follow, and assume nothing stronger.

What to skip: a blanket Disallow: / for every AI user agent as a reflex "AI opt-out." It removes your citation eligibility on the surfaces that depend on those crawlers (OpenAI states sites blocking OAI-SearchBot "will not be shown in ChatGPT search answers"), does not untrain existing models, and does not bind user-initiated fetchers. If the actual goal is "stay searchable, don't train on me," that is a per-agent split plus an emerging usage-preference signal, not a wall.

How it relates to other concepts

AI access control is the umbrella this entry sits under: four questions, four mechanisms. Robots.txt answers only the first (may you fetch this URL), and the umbrella entry's worked example shows it combined with the other layers into one coherent policy.
AI crawler bots are the agents robots.txt addresses. That entry carries the per-engine user-agent table (training vs retrieval vs user-initiated, control tokens like Google-Extended) and the copy-paste ruleset; this entry covers what the protocol itself can and cannot promise.
AIPREF is the usage-preference layer robots.txt lacks: a Disallow cannot say "fetch this but don't train on it." The proposed Content-Usage rule rides inside robots.txt, making the same file a carrier for both access and preference signals.
llms.txt is frequently confused with robots.txt but is guidance, not access: a curated AI-readable map with no exclusion semantics and, unlike robots.txt, no confirmed engine support.
Web Bot Auth addresses the identity gap that makes robots.txt spoofable: user-agent strings are self-declared, so a rule binds whoever chooses to identify as that agent. Cryptographic crawler identity is what would let a site enforce per-bot policy rather than request it.

Koster, Illyes, Zeller, Sassman. "Robots Exclusion Protocol (REP)." RFC 9309, IETF Proposed Standard, September 2022. Standardizes the file location, User-agent / Allow / Disallow syntax, longest-match precedence, and caching behavior. The protocol is advisory: it defines how rules are expressed and parsed, and compliance by any given crawler is voluntary. ↩ ↩² ↩³
Martijn Koster proposed the robots.txt convention on the www-talk mailing list on February 25, 1994, while at Nexor; it became a de facto standard within months and remained one for 28 years before RFC 9309. History per Wikipedia: robots.txt. Google announced the push for IETF standardization in "Formalizing the Robots Exclusion Protocol Specification" (Google Search Central Blog, July 1, 2019). ↩ ↩²
Google Search Central, "Introduction to robots.txt": robots.txt "is used mainly to avoid overloading your site with requests" and "is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page." The same page documents that a disallowed URL "can still be indexed if linked to from other sites," appearing in results without a description. ↩ ↩²
OpenAI, bots documentation: ChatGPT-User performs user-initiated fetches and, "because these actions are initiated by a user, robots.txt rules may not apply"; GPTBot (training) and OAI-SearchBot (search retrieval) are documented as independent, robots.txt-respecting toggles, and sites blocking OAI-SearchBot "will not be shown in ChatGPT search answers, though can still appear as navigational links." Perplexity, crawler documentation: Perplexity-User supports user-initiated actions and "generally ignores" robots.txt rules. ↩
Cloudflare, "Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives", August 4, 2025. Documented fetches of never-indexed test domains, whose robots.txt disallowed Perplexity's declared crawlers, via an undeclared Chrome-impersonating user agent on rotating IPs. ↩

Part of Infrastructure· editorial cluster, not a semantic link

Cluster pillar: AI access control→

Also in this cluster: AI access control · AI crawler blocking · AI crawler bots · AIPREF (AI usage preferences) · IndexNow Protocol · +2 more

Mentioned in· auto-generated from other terms' related lists

AI crawler blocking

FAQ

Does robots.txt stop AI from crawling or using my content?: Only partially, and only for compliant bots. Robots.txt is voluntary by design (RFC 9309 standardizes the format, not enforcement), so well-behaved declared crawlers like GPTBot honor it while nothing compels others. The operators themselves document carve-outs: OpenAI states robots.txt rules 'may not apply' to its user-initiated ChatGPT-User fetcher, and Perplexity states Perplexity-User 'generally ignores' robots.txt. Cloudflare additionally documented undeclared stealth crawling in August 2025. And blocking is not retroactive: disallowing a training crawler today does not remove content from models already trained.
Does blocking a page in robots.txt remove it from Google or AI search?: No. Google's own documentation is explicit that robots.txt 'is not a mechanism for keeping a web page out of Google': a disallowed URL can still be indexed if other sites link to it, appearing in results without a description. There is also a structural trap: a noindex directive only works if the crawler can fetch the page to see it, so robots.txt-blocking a page prevents the engine from reading the very directive that would deindex it. To keep a page out of results, use noindex or authentication, not a Disallow rule.
Should I block AI crawlers in robots.txt?: Decide per crawler purpose, not as one switch. Blocking retrieval crawlers (OAI-SearchBot, PerplexityBot) removes you from those engines' citation surfaces; OpenAI states blocked sites 'will not be shown in ChatGPT search answers.' Blocking training crawlers (GPTBot, ClaudeBot) or setting control tokens (Google-Extended) is a separate licensing and IP decision that does not affect search visibility, with one trap: Google-Extended governs Gemini and Vertex training, not AI Overviews, which use Googlebot. A common deliberate policy is to allow retrieval and block training.
What is the difference between robots.txt, llms.txt, and AIPREF?: They answer different questions at different layers. Robots.txt is fetch access: may a crawler retrieve this URL at all. llms.txt is guidance: a curated, AI-readable map of a site's important content, with no access semantics. AIPREF (the Content-Usage signal) is usage preference: given the content, what may it be used for, such as training versus search. AIPREF can even ride inside robots.txt as a proposed Content-Usage rule, so the layers coexist in one file. None of the three verify who is asking; that is Web Bot Auth's lane.

Sources & further reading

New terms shipped that week, plus one observation from the AI-citation tracker.

More about what you'll get

Last fact-checked 2026-06-12. Spotted an error or stale claim? See editorial methodology.

Changelog (2 entries)

2026-06-12: Initial publish: robots.txt is the crawl-access file standardized as RFC 9309 (2022), and this entry's focus is what it cannot do in the AI era: compliance is voluntary by the protocol's own design; blocking crawling neither removes already-indexed URLs nor expresses usage preferences; blocking is not retroactive for model training; and the major engines' user-initiated fetchers are documented by their own operators as partially exempt. Joins the infrastructure cluster as the fetch-access layer under AI access control, with the per-engine crawler detail kept in the AI crawler bots entry.
2026-06-12: Calibrated three overbroad claims after review: 'every major engine documents robots.txt support' corrected to declared-crawler scope with xAI's Grok as the documented exception; the user-initiated exemption is now attributed to the two operators that actually document it (OpenAI, Perplexity) and flagged as the operators' own classification rather than a neutral reading of the protocol; Cloudflare's stealth-crawling finding is noted as a single incident that Perplexity disputed. Allow-rule benefits restated as citation eligibility rather than a guarantee, and Google's July 2019 standardization announcement added as a primary source.