/terms/ai-access-control · 5 min read · intermediate
AI access control
Cluster pillar
Citation status
Last checked 2026-06-04
AI access control is the umbrella for the set of signals a website uses to govern how AI systems discover, fetch, identify against, and use its content. There is no single switch. Instead there is a small family of mechanisms, each answering a different question, and the recurring mistake is to collapse them into one imagined "AI opt-out" when they are genuinely separate actions.
"Access control" is used here in a broad publisher-policy sense: most of the mechanisms below are signals or preferences, not hard technical access controls (authentication, paywalls, WAF rules). The surface also spans both restriction (access, usage, identity) and discovery (notifying engines of new content), so it is better read as a control-and-discovery map than as a set of blocks.
The control surface breaks into distinct questions:
| Question | Signal | What it does | Honest status |
|---|---|---|---|
| May a bot fetch this URL? | robots.txt (RFC 9309) | Legacy crawl-access directives | Standardized, but voluntary: compliant bots honor it, others may not |
| What is worth reading here? | llms.txt | A curated, AI-readable map of the site | Community proposal; no major engine has confirmed using it |
| What may the content be used for? | AIPREF / Content-Usage | Usage-preference vocabulary (train-ai, search; y/n) |
IETF working group, pre-standardization; voluntary |
| Who is the requester? | Web Bot Auth | Cryptographic verification of a bot's declared identity | Emerging IETF-track; verifies identity, does not itself block |
| How do I notify engines of changed URLs? | IndexNow | Instant URL push to participating engines | Live and adopted, but a discovery push, not a control |
The agents all of these are aimed at are AI crawler bots (GPTBot, PerplexityBot, ClaudeBot and others, spanning training crawlers, search and retrieval crawlers, and user-triggered agents).
Two rows sit in the map for context rather than because they restrict anything: llms.txt only guides what AI is pointed to read, and IndexNow only accelerates discovery. Neither controls access or usage; they are included because publishers weigh them alongside the genuine access, usage, and identity signals.
Status in 2026
The defining feature of AI access control in 2026 is that it is fragmented and mostly voluntary. Each question above has its own signal at its own maturity: robots.txt is a settled standard, IndexNow is live, Web Bot Auth and AIPREF are emerging IETF-track work, and llms.txt is a community proposal with no confirmed engine support. There is no unified "AI policy" file that answers all four questions at once, and the efforts that exist deliberately stay in their own lanes (the AIPREF charter, for instance, puts crawler authentication explicitly out of scope, leaving that to Web Bot Auth).
The practical consequence is that "controlling AI access" is not one decision but several, and most of the levers are requests rather than guarantees. robots.txt, llms.txt, and AIPREF all rely on the AI operator choosing to comply. Web Bot Auth changes the picture only in that it lets a site verify identity (so it can decide whom to serve), but verification is not itself a block. The one thing none of these provide is enforcement against an operator who ignores them; that requires technical access control (authentication, rate limits, WAF rules) and, ultimately, contracts and licensing.
How to apply
The useful move is to pick the signal that matches the action you actually want, and to stop expecting one of them to do another's job:
- To limit fetching by compliant crawlers: use robots.txt. Understand it is honored by well-behaved bots and ignored by others, and that it governs fetching only, not downstream use.
- To declare a usage preference (e.g. allow search, disallow training): use AIPREF (
train-ai=n,search=y). Treat it as a machine-readable statement of intent whose effect depends on voluntary adoption, and note the attachment syntax is still pre-standardization. - To guide what AI reads, not whether it may: use llms.txt as curation, while remembering no major engine has confirmed consuming it.
- To know who is actually requesting: adopt Web Bot Auth as it matures, so you can distinguish a verified operator from a spoofed user agent before deciding how to respond.
- To get new content noticed faster: use IndexNow. This is the discovery side of the surface, the opposite of restriction.
Worked example, to make the disambiguation concrete: to stay citable in AI search while opting out of training, you might set AIPREF train-ai=n, search=y, keep robots.txt allowing search crawlers, and adopt Web Bot Auth so a verified search crawler can be told apart from a spoofed trainer. Four signals, one coherent policy, and not a single "opt-out" switch.
What to skip:
- Treating any one signal as a complete "AI opt-out." Blocking crawling, opting out of training, guiding reading, and verifying identity are four different actions; no single file does all four.
- Assuming a preference signal is enforcement. robots.txt, llms.txt, and AIPREF are requests; an operator can ignore them. Reserve "control" language for technical access control and contracts.
- Waiting for one unified standard before acting. The space is fragmented by design; use the mature signal for each goal now rather than expecting convergence soon.
How it relates to other concepts
- AIPREF is the usage-preference layer: it answers "what may the content be used for," and its charter deliberately excludes the identity question, which is why it and Web Bot Auth are complementary rather than overlapping.
- Web Bot Auth is the identity layer: cryptographic verification of which bot is asking, the prerequisite for any access decision that depends on trusting the requester.
- llms.txt is the guidance layer: it shapes what AI reads rather than whether it may, and sits alongside (not inside) the access and preference signals.
- IndexNow is the discovery side of the same surface: where the others restrict or qualify access, IndexNow accelerates it, which is why it belongs in the same map even though it points the opposite direction.
- AI crawler bots are the agents these signals target: the family of signals only makes sense relative to the user agents (GPTBot, PerplexityBot, ClaudeBot, and others) they are meant to govern.
- Upstream of generative engine optimization strategy: deciding which signals to set (stay visible in AI search while declaring a training preference, for instance) is a GEO-policy decision that this map is meant to make legible.
Part of Infrastructure· editorial cluster, not a semantic link
Also in this cluster: AI crawler bots · AIPREF (AI usage preferences) · IndexNow Protocol · LLMS.txt · Web Bot Auth
Related terms
Mentioned in· auto-generated from other terms' related lists
FAQ
- What is AI access control?
- AI access control is the umbrella term for the signals a website uses to govern how AI systems discover, fetch, identify against, and use its content. There is no single switch; instead there are several distinct mechanisms answering distinct questions: robots.txt controls whether a crawler may fetch a URL, llms.txt offers a curated AI-readable map of the site, AIPREF (the Content-Usage signal) declares what the content may be used for (such as training versus search), and Web Bot Auth cryptographically verifies which bot is making a request. Most are voluntary, and they are easily conflated.
- How do I stop AI from training on my content?
- There is no fully reliable technical switch, and this is exactly where the signals get conflated. Blocking a crawler in robots.txt may stop fetching by a compliant bot but does not by itself declare a training-use preference; declaring train-ai=n through AIPREF states a preference about training use but does not enforce it; and neither verifies who is actually requesting (that is Web Bot Auth). All of these depend on voluntary compliance by the AI operator. The honest answer is that you can express preferences and block compliant crawlers, but technical access control plus contracts and licensing, not a single tag, are what create an actual barrier. If content must not be accessed or used at all, the reliable path is not to publish it publicly: put it behind authentication, paywalls, or contractual licensing with technical controls.
- Is robots.txt enough to control AI access?
- No, for two reasons. First, robots.txt only governs fetch access (may a crawler retrieve the URL); it says nothing about what an AI system may do with content it has already obtained, which is the usage-preference question AIPREF addresses, nor about verifying the requester's identity, which Web Bot Auth addresses. Second, robots.txt is a voluntary standard: well-behaved bots honor it, but it is a request, not an enforcement mechanism. Treating robots.txt as a complete AI-control solution conflates fetch access with usage preference and identity, which are separate problems.
Sources & further reading
- RFC 9309: Robots Exclusion Protocol (the standardized robots.txt, the legacy fetch-access layer)2022-09-01
- IETF AI Preferences (aipref) working group (the usage-preference layer; see the AIPREF entry)
- RFC 9421: HTTP Message Signatures (the basis of Web Bot Auth, the identity layer; see the Web Bot Auth entry)2024-02-01
- IndexNow protocol official site (the discovery-side push signal; backed by Bing, Naver, Seznam, Yandex, Yep)
- llms.txt proposal (Jeremy Howard, September 2024; the guidance layer; see the llms.txt entry)2024-09-03
Get the monthly digest
New terms shipped that week, plus one observation from the AI-citation tracker.