In 2026, the server logs of a typical enterprise website show a new pattern. Alongside Googlebot, Bingbot and the classical SEO bots, new user agents appear with rising frequency: GPTBot, OAI-SearchBot, ClaudeBot, anthropic-ai, PerplexityBot, CCBot, Google-Extended, Applebot-Extended. On many domains, these crawlers now make up 15-35% of bot traffic — and the trend is upward.
The widespread reflex — blanket blocking via robots.txt or Cloudflare rules — is strategically fatal. Block GPTBot and you are telling OpenAI: "Please train your next model without our content." That may be legally sensible for publisher properties; for almost every other business model it is existentially threatening in the long run.
The most relevant AI crawlers at a glance
The overview below shows the twelve most relevant AI crawlers in 2026 with operator, purpose and exact user-agent string for robots.txt and log-file configuration.
| Crawler | Operator | Training data | Live search | User-agent string |
|---|---|---|---|---|
| GPTBot | OpenAI | yes | no | GPTBot |
| OAI-SearchBot | OpenAI | no | yes (ChatGPT Search) | OAI-SearchBot |
| ChatGPT-User | OpenAI | no | yes (browsing mode) | ChatGPT-User |
| ClaudeBot | Anthropic | yes | no | ClaudeBot |
| Claude-Web | Anthropic | no | yes (Claude.ai browsing) | Claude-Web |
| PerplexityBot | Perplexity | no | yes | PerplexityBot |
| Perplexity-User | Perplexity | no | yes (user-initiated) | Perplexity-User |
| Google-Extended | yes (Gemini, AI Overviews) | no | Google-Extended | |
| Applebot-Extended | Apple | yes (Apple Intelligence) | no | Applebot-Extended |
| CCBot | Common Crawl | yes (training corpus) | no | CCBot |
| Meta-ExternalAgent | Meta | yes (Llama) | no | Meta-ExternalAgent |
| Bytespider | ByteDance | yes | no | Bytespider |
Note on classification: "Training data: yes" means the crawler collects content for future model iterations — a block only affects your brand's model representation months later. "Live search: yes" marks crawlers that pull content in real time for answers inside search products (ChatGPT Search, Perplexity, Claude.ai browsing) — a block removes you immediately from citable sources. This distinction is the strategic basis for any differentiated robots.txt.
OpenAI: GPTBot and OAI-SearchBot
OpenAI runs two different crawlers with distinct functions. GPTBot collects training data for future model iterations. OAI-SearchBot and ChatGPT-User are activated for live retrieval on search queries inside ChatGPT. The distinction is fundamental: block GPTBot but allow OAI-SearchBot, and you can still be cited in ChatGPT — but you will no longer migrate into model memory.
Anthropic: ClaudeBot & anthropic-ai
Anthropic runs ClaudeBot for training-data collection and anthropic-ai / Claude-Web for specific product features. Its robots.txt compliance rate is higher than for some competitors, which simplifies block control.
Google-Extended
A separate user agent that applies exclusively to training for Google's Bard/Gemini products — not for the classical Googlebot index. Block Google-Extended and your site disappears from Gemini training, but remains in the Google search index.
Apple: Applebot-Extended
Analogous to Google-Extended: an opt-out user agent that Apple introduced in 2024. It blocks training for Apple Intelligence products without touching the regular Siri index.
PerplexityBot & CCBot
PerplexityBot collects for Perplexity's hybrid search system. CCBot is the crawler of the Common Crawl project, whose data in turn serves as a training base for nearly every large LLM. Blocking CCBot has cascading effects across many models.
The strategic core decision: what to allow, what to block?
The answer depends on the business model. Three main scenarios:
Scenario 1: brands and service providers (default recommendation)
For brands, service providers, B2B vendors and most corporate sites, AI visibility is a marketing asset, not a content loss. Allow every relevant AI crawler, control crawl budget, monitor server load.
Scenario 2: publishers and news outlets
A more complex trade-off space. Blanket blocking protects current content value but costs future relevance in the AI era. Many top publishers now run a hybrid course: block GPTBot (training) and allow OAI-SearchBot (live retrieval). You stay citable but prevent training appropriation.
Scenario 3: sensitive or legally exposed content
Pages with personal data, legally protected content or paywalled material: blanket block, plus IP-based rate limits.
Robots.txt: the right configuration
A clean, differentiated robots.txt for the default scenario:
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
# Sensitive areas
User-agent: *
Disallow: /checkout/
Disallow: /account/
Disallow: /internal/
A frequent mistake: User-agent: * with Disallow: / blocks every crawler — including every AI crawler that respects robots.txt. Differentiation is essential.
Managing crawl budget and server load
AI crawlers create real costs. On an enterprise site with 50,000 pages, the daily crawl volume from several AI bots can add multiple gigabytes of traffic per day. Without control, this leads to:
- Higher cloud bills (CDN bandwidth, origin requests)
- Rate-limit issues on backend APIs
- Degraded user experience under insufficient capacity
Practical measures:
- Optimize caching for bot traffic: aggressive edge-caching strategies for HTML (CDN level), because AI bots usually parse only HTML and do not need JS execution
- Crawl-delay in robots.txt:
Crawl-delay: 5(seconds between requests) is respected by many AI crawlers - Cloudflare/Fastly bot management: differentiated rate limits per user agent
- Sitemap optimization: prioritize the most important content; do not list less important pages in the sitemap
JavaScript rendering and AI crawlers
A critical technical point that is often underestimated: most AI crawlers do not execute JavaScript. While Googlebot renders complex pages through Chromium Headless, GPTBot, ClaudeBot and PerplexityBot see only the initially served HTML. Dynamic content loaded client-side via React/Vue/Angular is invisible to these crawlers.
The concrete consequences:
- Single-page applications (SPAs) must use server-side rendering (SSR) or static site generation (SSG) to be visible to LLMs
- Infinite-scroll content is mostly missed — relevant content must be delivered initially
- Lazy-loaded content (images, sections) needs fallback structures in the source HTML
- JSON-LD in the source HTML works more reliably than dynamically injected schema markup
Bot traffic from AI crawlers on enterprise sites (2026)
JavaScript rendering on most AI crawlers
Typical crawl interval for top pages
The status code that hurts
An often-overlooked factor: 429 Too Many Requests and 503 Service Unavailable to AI crawlers signal to the system, over time, that the source is unreliable. Several large LLM providers reduce crawl frequency after repeated errors or deprioritize the source for future training runs. An under-dimensioned server can systematically erode your AI visibility — without classical SEO reports catching it.
Structured data: the LLM accelerator
Where classical SEO teams treat schema markup as a CTR booster for rich snippets, structured markup has a more fundamental function in the AI era: it reduces ambiguity for models and raises the probability of correct information extraction.
Especially effective:
Organizationwith a fullsameAsarray (Wikipedia, Wikidata, LinkedIn, Crunchbase)Articlewith a clearauthorentity (asPersonschema, not just a name)DefinedTermfor concept definitionsFAQPagewith clearly answered questionsHowTowith structured steps
Monitoring: what you should measure
A modern technical-SEO monitor actively includes AI crawlers:
- Bot traffic by user agent: log daily, review monthly
- Response-code distribution per bot: 200s should be > 95%
- Crawl depth per bot: which directories are visited? Are important sections missing?
- Crawl-frequency trends: is attention from specific AI systems rising or falling?
- Correlation with LLM visibility: reconcile prompt-audit results with crawl activity
The complete robots.txt for differentiated AI-crawler access
A typical production setup for a B2B brand with high reputation interest that also protects monetized archives:
# SUMAX Enterprise Reference Configuration
# Last updated: 2026-03-01
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Allow: /
Disallow: /members/
Disallow: /internal/
User-agent: GPTBot
Allow: /
Disallow: /members/
Disallow: /pricing-calculator/
Disallow: /internal/
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
Disallow: /members/
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Allow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: *
Allow: /
Disallow: /cgi-bin/
Disallow: /search?
Disallow: /*?utm_
Disallow: /print/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
The distinctions matter: GPTBot (training) must not see monetizable assets; OAI-SearchBot (live retrieval for ChatGPT Search) sees everything, because that is where citation value is created. Google-Extended is not blanket-blocked — block it and you disappear from AI Overviews, even though regular ranking stays intact. One of the most common strategic mistakes of 2024.
Log-file analysis: the operational gold standard
Crawler behaviour cannot be measured with SEO tools — only with server logs. A minimal setup for AI-crawler analysis:
# Extract AI-crawler hits from Apache/Nginx log (awk/cut)
# Fields: IP, UserAgent, Status, Path, Timestamp
grep -E "GPTBot|OAI-SearchBot|ClaudeBot|PerplexityBot|Google-Extended" access.log \
| awk '{print $1, $7, $9, $NF}' \
> ai_crawler_hits.tsv
# Aggregation: hits per bot per day per path pattern
# Target metrics:
# - hit rate per path cluster (/blog/*, /product/*, /case-study/*)
# - 2xx rate per bot (target > 97%)
# - median response time per bot (target < 600 ms)
# - re-crawl interval (median delta between two hits of the same path)
A healthy crawl pattern for enterprise domains:
GPTBot hits/day for mid-sized enterprise domains (~5k URLs)
Typical re-crawl interval for top content
Target 2xx rate per AI crawler
JavaScript rendering: the invisible citation barrier
Every AI crawler in the wild — with the exception of OAI-SearchBot and PerplexityBot — does not render JavaScript. They read only the initial HTML document. Anything loaded client-side simply does not exist for them.
Practical consequences:
- SPA architectures without SSR are invisible to training crawlers. React pages with CSR only deliver an empty
<div id="root"></div>to GPTBot. - Cookie walls in front of content prevent any citation. Even if Google sees the content later, the training crawl already left empty-handed.
- Lazy-loaded text blocks are not captured. Anything that is faded in "further down" via JS is invisible to trainers.
- Web components without a light-DOM fallback are equally opaque.
Solutions, ranked by effort:
- Activate SSR. Next.js, Nuxt, Remix ship static HTML out of the box. Minimum effort, maximum effect.
- Dynamic rendering (server-side renders for bots, client-side for users). Acceptable as a bridge, not recommended long term.
- Prerendering. Static HTML snapshots on the CDN served on bot detection. Tools: Prerender.io, Rendertron.
- Content migration to MDX/Markdown sources with static build. The cleanest solution for content platforms.
Rate limiting, CDN policy and the 429 dead-zone effect
Aggressive WAF/CDN rules (Cloudflare, Akamai, Fastly) often block AI crawlers unnoticed. Typical scenario: the WAF sees an unusual user-agent pattern, classifies it as bot traffic, throttles to 10 req/min. GPTBot hits the limit, receives 429 Too Many Requests and backs off — for weeks. The domain disappears from LLM outputs even though robots.txt is clean.
Controls:
- Explicitly allowlist WAF rules for verified AI-crawler IPs (OpenAI publishes IP ranges; Anthropic does too)
- Verify via reverse DNS + forward DNS, not just UA string (UA spoofing is trivial)
- Rate limits for AI crawlers at least 10× higher than standard bot limits
- Monitoring: review 4xx/5xx rates per bot weekly
Sitemap strategy: separate signals for separate purposes
A single sitemap.xml is no longer sufficient for modern AI infrastructure. We recommend a three-sitemap structure:
- sitemap-core.xml — canonical, durable URLs.
changefreqweekly,priority0.8-1.0. For training crawlers. - sitemap-news.xml — news format with publication node. For OAI-SearchBot, PerplexityBot. Dynamic, only the last 72 hours.
- sitemap-knowledge.xml — definitional/evergreen content (pillar pages, glossary, studies). Especially important for LLM training.
The split helps crawlers prioritize content by lifecycle and purpose. GPTBot spends disproportionate budget in sitemap-knowledge, OAI-SearchBot in sitemap-news. A monolithic sitemap forces identical prioritization on both scenarios — suboptimal.
Monitoring dashboard: what gets reviewed weekly
Technical AI-crawler governance needs its own dashboard. Six core metrics:
- Crawler coverage: share of the URL population visited at least once by every relevant AI crawler in the past 30 days. Target: > 85%.
- Response quality: 2xx rate per bot. Target: > 97%.
- Re-crawl latency: median interval between updates and the first re-crawl. Target: < 7 days for top content.
- Blocked ratio: 4xx/5xx or 429 responses per bot. Target: < 2%.
- Rendered-content ratio: Lighthouse-based check on which share of content is visible pre-JS. Target: > 90%.
- Citation correlation: match between heavily crawled paths and LLM citation outcomes from prompt audits.
The invisible 15% domain
Our audits regularly reveal enterprise domains where 15-30% of all URLs are effectively unreachable for AI crawlers — not because of robots.txt, but because of WAF throttling, outdated SSL configuration or false JS-rendering assumptions. This gap is often unknown internally because classical SEO tools do not surface it. Only the combination of log-file analysis, prompt audit and infrastructure check exposes it.
Conclusion
Technical SEO is not a settled topic in the AI era — it is a strategically upgraded field. The infrastructure decisions you make today determine whether your brand is stored as a reliable source in the next model generations — or remains a fragmented, contradictory entity in the noise.
Blanket blocking may feel defensively correct. For most business models it is a strategic self-limitation with a long downstream effect.