/terms/robots-txt · 5 min read · foundational
Robots.txt (Robots Exclusion Protocol)
Citation status
Last checked 2026-06-12
Robots.txt is the plain-text file at a site's root (https://example.com/robots.txt) that tells crawlers which URL paths they may fetch. Proposed by Martijn Koster in February 1994 and a de facto convention for nearly three decades, it was standardized as the Robots Exclusion Protocol (REP) in RFC 9309 in September 202212. The file's vocabulary is small: User-agent lines name which bot the following rules address, Disallow and Allow lines list path prefixes, and a Sitemap line points to the XML sitemap. That is essentially the whole protocol.
What robots.txt is not is an enforcement mechanism, and in the AI era that distinction carries most of the weight. RFC 9309 standardizes the file format and parsing rules; compliance is voluntary, and the protocol cannot compel any crawler to obey it1. Google's own documentation states plainly that robots.txt "is not a mechanism for keeping a web page out of Google"3. A Disallow line is a request that well-behaved bots honor. Everything beyond that, who is actually asking, what they do with content they already have, and what happens when they ignore the request, belongs to other layers of AI access control.
Minimum viable robots.txt
# https://example.com/robots.txt
User-agent: * # rules for every bot
Disallow: /drafts/ # please do not fetch these paths
User-agent: GPTBot # rules for one named crawler
Disallow: / # asks OpenAI's training crawler to stay out
Sitemap: https://example.com/sitemap.xml
Rules bind (voluntarily) per named user agent; the full per-engine table of AI crawler names and their documented purposes lives in the AI crawler bots entry.
Status in 2026
Robots.txt is simultaneously at its most standardized and its most stressed. The standard half: RFC 9309 has been a Proposed Standard since September 2022, the major search and AI operators document robots.txt support for their declared crawlers (xAI's Grok is the documented exception: no first-party crawler contract), and new proposed machine-readable usage-preference signals are being designed to ride inside the same file; the IETF's AIPREF work attaches its proposed Content-Usage rule to robots.txt as a carrier, three decades after the format was sketched on a mailing list2.
The stressed half is specific and well documented, mostly by the AI operators themselves. The "does robots.txt stop AI" question decomposes into four gaps. First, compliance is voluntary by design: declared crawlers like GPTBot and ClaudeBot document that they honor it, but nothing in the protocol compels anyone1. Second, OpenAI and Perplexity document partial exemptions for their user-initiated fetchers: OpenAI states robots.txt rules "may not apply" to ChatGPT-User because a human asked for the page, and Perplexity states Perplexity-User "generally ignores" robots.txt for the same reason4. Note that this exemption is the operators' own classification (user-initiated fetches framed as browser-like use outside the protocol's scope), not a neutral reading of RFC 9309; the dispute below challenges exactly that boundary. Third, beyond the documented carve-outs sits undeclared crawling: Cloudflare's August 2025 investigation found Perplexity fetching never-indexed test domains through an undeclared Chrome-impersonating user agent after the declared crawler was disallowed; Perplexity disputed the framing, and it remains a single documented incident rather than a measured general rule5. Fourth, blocking is not retroactive: a Disallow: / for a training crawler stops future collection by that compliant bot; it removes nothing from models already trained.
There is also a quieter trap on the search side: blocking a URL in robots.txt does not deindex it. Google documents that a disallowed page can still be indexed if linked from elsewhere, and a page the crawler cannot fetch cannot show the engine its noindex directive, so robots.txt-blocking a page disables the one mechanism that would actually remove it3.
How to apply
Robots.txt is the first lever of AI access control, not the last. Three concrete moves:
- Audit what your file actually asks. Fetch your own
/robots.txtand read it as a bot would: which user agents are named, which paths are disallowed for*. Validate behavior with Search Console's robots.txt report, and remember rules are matched per named agent, not inherited across them. - Split retrieval from training deliberately. Allowing retrieval bots (OAI-SearchBot, PerplexityBot) keeps you eligible for the citation surfaces those crawlers feed (eligibility, not a guarantee of citation); blocking training bots (GPTBot, ClaudeBot) or setting the Google-Extended control token is an independent licensing decision. Watch the trap: Google-Extended governs Gemini/Vertex training, not AI Overviews, which use Googlebot. Per-engine names and purposes: AI crawler bots.
- Pair it with enforcement where it matters. For actors that ignore requests, the protocol has no answer; that layer is bot management and WAF rules (and, for verifying who is actually asking, the emerging Web Bot Auth). Treat robots.txt as the documented, machine-readable statement of crawl policy that compliant bots will follow, and assume nothing stronger.
What to skip: a blanket Disallow: / for every AI user agent as a reflex "AI opt-out." It removes your citation eligibility on the surfaces that depend on those crawlers (OpenAI states sites blocking OAI-SearchBot "will not be shown in ChatGPT search answers"), does not untrain existing models, and does not bind user-initiated fetchers. If the actual goal is "stay searchable, don't train on me," that is a per-agent split plus an emerging usage-preference signal, not a wall.
How it relates to other concepts
- AI access control is the umbrella this entry sits under: four questions, four mechanisms. Robots.txt answers only the first (may you fetch this URL), and the umbrella entry's worked example shows it combined with the other layers into one coherent policy.
- AI crawler bots are the agents robots.txt addresses. That entry carries the per-engine user-agent table (training vs retrieval vs user-initiated, control tokens like Google-Extended) and the copy-paste ruleset; this entry covers what the protocol itself can and cannot promise.
- AIPREF is the usage-preference layer robots.txt lacks: a
Disallowcannot say "fetch this but don't train on it." The proposed Content-Usage rule rides inside robots.txt, making the same file a carrier for both access and preference signals. - llms.txt is frequently confused with robots.txt but is guidance, not access: a curated AI-readable map with no exclusion semantics and, unlike robots.txt, no confirmed engine support.
- Web Bot Auth addresses the identity gap that makes robots.txt spoofable: user-agent strings are self-declared, so a rule binds whoever chooses to identify as that agent. Cryptographic crawler identity is what would let a site enforce per-bot policy rather than request it.
Footnotes
-
Koster, Illyes, Zeller, Sassman. "Robots Exclusion Protocol (REP)." RFC 9309, IETF Proposed Standard, September 2022. Standardizes the file location,
User-agent/Allow/Disallowsyntax, longest-match precedence, and caching behavior. The protocol is advisory: it defines how rules are expressed and parsed, and compliance by any given crawler is voluntary. ↩ ↩2 ↩3 -
Martijn Koster proposed the robots.txt convention on the www-talk mailing list on February 25, 1994, while at Nexor; it became a de facto standard within months and remained one for 28 years before RFC 9309. History per Wikipedia: robots.txt. Google announced the push for IETF standardization in "Formalizing the Robots Exclusion Protocol Specification" (Google Search Central Blog, July 1, 2019). ↩ ↩2
-
Google Search Central, "Introduction to robots.txt": robots.txt "is used mainly to avoid overloading your site with requests" and "is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with
noindexor password-protect the page." The same page documents that a disallowed URL "can still be indexed if linked to from other sites," appearing in results without a description. ↩ ↩2 -
OpenAI, bots documentation: ChatGPT-User performs user-initiated fetches and, "because these actions are initiated by a user, robots.txt rules may not apply"; GPTBot (training) and OAI-SearchBot (search retrieval) are documented as independent, robots.txt-respecting toggles, and sites blocking OAI-SearchBot "will not be shown in ChatGPT search answers, though can still appear as navigational links." Perplexity, crawler documentation: Perplexity-User supports user-initiated actions and "generally ignores" robots.txt rules. ↩
-
Cloudflare, "Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives", August 4, 2025. Documented fetches of never-indexed test domains, whose robots.txt disallowed Perplexity's declared crawlers, via an undeclared Chrome-impersonating user agent on rotating IPs. ↩
Part of Infrastructure· editorial cluster, not a semantic link
Cluster pillar: AI access control→
Also in this cluster: AI access control · AI crawler blocking · AI crawler bots · AIPREF (AI usage preferences) · IndexNow Protocol · +2 more
Related terms
Mentioned in· auto-generated from other terms' related lists
FAQ
- Does robots.txt stop AI from crawling or using my content?
- Only partially, and only for compliant bots. Robots.txt is voluntary by design (RFC 9309 standardizes the format, not enforcement), so well-behaved declared crawlers like GPTBot honor it while nothing compels others. The operators themselves document carve-outs: OpenAI states robots.txt rules 'may not apply' to its user-initiated ChatGPT-User fetcher, and Perplexity states Perplexity-User 'generally ignores' robots.txt. Cloudflare additionally documented undeclared stealth crawling in August 2025. And blocking is not retroactive: disallowing a training crawler today does not remove content from models already trained.
- Does blocking a page in robots.txt remove it from Google or AI search?
- No. Google's own documentation is explicit that robots.txt 'is not a mechanism for keeping a web page out of Google': a disallowed URL can still be indexed if other sites link to it, appearing in results without a description. There is also a structural trap: a noindex directive only works if the crawler can fetch the page to see it, so robots.txt-blocking a page prevents the engine from reading the very directive that would deindex it. To keep a page out of results, use noindex or authentication, not a Disallow rule.
- Should I block AI crawlers in robots.txt?
- Decide per crawler purpose, not as one switch. Blocking retrieval crawlers (OAI-SearchBot, PerplexityBot) removes you from those engines' citation surfaces; OpenAI states blocked sites 'will not be shown in ChatGPT search answers.' Blocking training crawlers (GPTBot, ClaudeBot) or setting control tokens (Google-Extended) is a separate licensing and IP decision that does not affect search visibility, with one trap: Google-Extended governs Gemini and Vertex training, not AI Overviews, which use Googlebot. A common deliberate policy is to allow retrieval and block training.
- What is the difference between robots.txt, llms.txt, and AIPREF?
- They answer different questions at different layers. Robots.txt is fetch access: may a crawler retrieve this URL at all. llms.txt is guidance: a curated, AI-readable map of a site's important content, with no access semantics. AIPREF (the Content-Usage signal) is usage preference: given the content, what may it be used for, such as training versus search. AIPREF can even ride inside robots.txt as a proposed Content-Usage rule, so the layers coexist in one file. None of the three verify who is asking; that is Web Bot Auth's lane.
Sources & further reading
- RFC 9309: Robots Exclusion Protocol (Koster, Illyes, Zeller, Sassman; Proposed Standard, September 2022)2022-09-01
- Google Search Central: Introduction to robots.txt ('not a mechanism for keeping a web page out of Google')
- Google Search Central Blog: Formalizing the Robots Exclusion Protocol Specification (July 2019, the IETF standardization push)2019-07-01
- OpenAI: bots documentation (GPTBot / OAI-SearchBot / ChatGPT-User; 'robots.txt rules may not apply' to user-initiated fetches)
- Perplexity: crawler documentation (PerplexityBot / Perplexity-User; user-initiated fetches 'generally ignore' robots.txt)
- Cloudflare: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives (August 4, 2025)
- Wikipedia: robots.txt (Martijn Koster's February 1994 proposal and protocol history)
Get the monthly digest
New terms shipped that week, plus one observation from the AI-citation tracker.