In 2026, the server logs of a typical enterprise website show a new pattern. Alongside Googlebot, Bingbot and the classical SEO bots, new user agents appear with rising frequency: GPTBot, OAI-SearchBot, ClaudeBot, anthropic-ai, PerplexityBot, CCBot, Google-Extended, Applebot-Extended. On many domains, these crawlers now make up 15-35% of bot traffic — and the trend is upward.

The widespread reflex — blanket blocking via robots.txt or Cloudflare rules — is strategically fatal. Block GPTBot and you are telling OpenAI: "Please train your next model without our content." That may be legally sensible for publisher properties; for almost every other business model it is existentially threatening in the long run.

The most relevant AI crawlers at a glance

The overview below shows the twelve most relevant AI crawlers in 2026 with operator, purpose and exact user-agent string for robots.txt and log-file configuration.

Crawler Operator Training data Live search User-agent string
GPTBotOpenAIyesnoGPTBot
OAI-SearchBotOpenAInoyes (ChatGPT Search)OAI-SearchBot
ChatGPT-UserOpenAInoyes (browsing mode)ChatGPT-User
ClaudeBotAnthropicyesnoClaudeBot
Claude-WebAnthropicnoyes (Claude.ai browsing)Claude-Web
PerplexityBotPerplexitynoyesPerplexityBot
Perplexity-UserPerplexitynoyes (user-initiated)Perplexity-User
Google-ExtendedGoogleyes (Gemini, AI Overviews)noGoogle-Extended
Applebot-ExtendedAppleyes (Apple Intelligence)noApplebot-Extended
CCBotCommon Crawlyes (training corpus)noCCBot
Meta-ExternalAgentMetayes (Llama)noMeta-ExternalAgent
BytespiderByteDanceyesnoBytespider

Note on classification: "Training data: yes" means the crawler collects content for future model iterations — a block only affects your brand's model representation months later. "Live search: yes" marks crawlers that pull content in real time for answers inside search products (ChatGPT Search, Perplexity, Claude.ai browsing) — a block removes you immediately from citable sources. This distinction is the strategic basis for any differentiated robots.txt.

OpenAI: GPTBot and OAI-SearchBot

OpenAI runs two different crawlers with distinct functions. GPTBot collects training data for future model iterations. OAI-SearchBot and ChatGPT-User are activated for live retrieval on search queries inside ChatGPT. The distinction is fundamental: block GPTBot but allow OAI-SearchBot, and you can still be cited in ChatGPT — but you will no longer migrate into model memory.

Anthropic: ClaudeBot & anthropic-ai

Anthropic runs ClaudeBot for training-data collection and anthropic-ai / Claude-Web for specific product features. Its robots.txt compliance rate is higher than for some competitors, which simplifies block control.

Google-Extended

A separate user agent that applies exclusively to training for Google's Bard/Gemini products — not for the classical Googlebot index. Block Google-Extended and your site disappears from Gemini training, but remains in the Google search index.

Apple: Applebot-Extended

Analogous to Google-Extended: an opt-out user agent that Apple introduced in 2024. It blocks training for Apple Intelligence products without touching the regular Siri index.

PerplexityBot & CCBot

PerplexityBot collects for Perplexity's hybrid search system. CCBot is the crawler of the Common Crawl project, whose data in turn serves as a training base for nearly every large LLM. Blocking CCBot has cascading effects across many models.

The strategic core decision: what to allow, what to block?

The answer depends on the business model. Three main scenarios:

Scenario 1: brands and service providers (default recommendation)

For brands, service providers, B2B vendors and most corporate sites, AI visibility is a marketing asset, not a content loss. Allow every relevant AI crawler, control crawl budget, monitor server load.

Scenario 2: publishers and news outlets

A more complex trade-off space. Blanket blocking protects current content value but costs future relevance in the AI era. Many top publishers now run a hybrid course: block GPTBot (training) and allow OAI-SearchBot (live retrieval). You stay citable but prevent training appropriation.

Scenario 3: sensitive or legally exposed content

Pages with personal data, legally protected content or paywalled material: blanket block, plus IP-based rate limits.

Robots.txt: the right configuration

A clean, differentiated robots.txt for the default scenario:

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

# Sensitive areas
User-agent: *
Disallow: /checkout/
Disallow: /account/
Disallow: /internal/

A frequent mistake: User-agent: * with Disallow: / blocks every crawler — including every AI crawler that respects robots.txt. Differentiation is essential.

Managing crawl budget and server load

AI crawlers create real costs. On an enterprise site with 50,000 pages, the daily crawl volume from several AI bots can add multiple gigabytes of traffic per day. Without control, this leads to:

Practical measures:

JavaScript rendering and AI crawlers

A critical technical point that is often underestimated: most AI crawlers do not execute JavaScript. While Googlebot renders complex pages through Chromium Headless, GPTBot, ClaudeBot and PerplexityBot see only the initially served HTML. Dynamic content loaded client-side via React/Vue/Angular is invisible to these crawlers.

The concrete consequences:

15-35%

Bot traffic from AI crawlers on enterprise sites (2026)

0%

JavaScript rendering on most AI crawlers

7-14 days

Typical crawl interval for top pages

Operator Insight

The status code that hurts

An often-overlooked factor: 429 Too Many Requests and 503 Service Unavailable to AI crawlers signal to the system, over time, that the source is unreliable. Several large LLM providers reduce crawl frequency after repeated errors or deprioritize the source for future training runs. An under-dimensioned server can systematically erode your AI visibility — without classical SEO reports catching it.

Structured data: the LLM accelerator

Where classical SEO teams treat schema markup as a CTR booster for rich snippets, structured markup has a more fundamental function in the AI era: it reduces ambiguity for models and raises the probability of correct information extraction.

Especially effective:

Monitoring: what you should measure

A modern technical-SEO monitor actively includes AI crawlers:

The complete robots.txt for differentiated AI-crawler access

A typical production setup for a B2B brand with high reputation interest that also protects monetized archives:

# SUMAX Enterprise Reference Configuration
# Last updated: 2026-03-01

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Allow: /
Disallow: /members/
Disallow: /internal/

User-agent: GPTBot
Allow: /
Disallow: /members/
Disallow: /pricing-calculator/
Disallow: /internal/

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /
Disallow: /members/

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Allow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: *
Allow: /
Disallow: /cgi-bin/
Disallow: /search?
Disallow: /*?utm_
Disallow: /print/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

The distinctions matter: GPTBot (training) must not see monetizable assets; OAI-SearchBot (live retrieval for ChatGPT Search) sees everything, because that is where citation value is created. Google-Extended is not blanket-blocked — block it and you disappear from AI Overviews, even though regular ranking stays intact. One of the most common strategic mistakes of 2024.

Log-file analysis: the operational gold standard

Crawler behaviour cannot be measured with SEO tools — only with server logs. A minimal setup for AI-crawler analysis:

# Extract AI-crawler hits from Apache/Nginx log (awk/cut)
# Fields: IP, UserAgent, Status, Path, Timestamp

grep -E "GPTBot|OAI-SearchBot|ClaudeBot|PerplexityBot|Google-Extended" access.log \
  | awk '{print $1, $7, $9, $NF}' \
  > ai_crawler_hits.tsv

# Aggregation: hits per bot per day per path pattern
# Target metrics:
#   - hit rate per path cluster (/blog/*, /product/*, /case-study/*)
#   - 2xx rate per bot (target > 97%)
#   - median response time per bot (target < 600 ms)
#   - re-crawl interval (median delta between two hits of the same path)

A healthy crawl pattern for enterprise domains:

4-12k

GPTBot hits/day for mid-sized enterprise domains (~5k URLs)

3-7 days

Typical re-crawl interval for top content

> 97%

Target 2xx rate per AI crawler

JavaScript rendering: the invisible citation barrier

Every AI crawler in the wild — with the exception of OAI-SearchBot and PerplexityBot — does not render JavaScript. They read only the initial HTML document. Anything loaded client-side simply does not exist for them.

Practical consequences:

Solutions, ranked by effort:

  1. Activate SSR. Next.js, Nuxt, Remix ship static HTML out of the box. Minimum effort, maximum effect.
  2. Dynamic rendering (server-side renders for bots, client-side for users). Acceptable as a bridge, not recommended long term.
  3. Prerendering. Static HTML snapshots on the CDN served on bot detection. Tools: Prerender.io, Rendertron.
  4. Content migration to MDX/Markdown sources with static build. The cleanest solution for content platforms.

Rate limiting, CDN policy and the 429 dead-zone effect

Aggressive WAF/CDN rules (Cloudflare, Akamai, Fastly) often block AI crawlers unnoticed. Typical scenario: the WAF sees an unusual user-agent pattern, classifies it as bot traffic, throttles to 10 req/min. GPTBot hits the limit, receives 429 Too Many Requests and backs off — for weeks. The domain disappears from LLM outputs even though robots.txt is clean.

Controls:

Sitemap strategy: separate signals for separate purposes

A single sitemap.xml is no longer sufficient for modern AI infrastructure. We recommend a three-sitemap structure:

  1. sitemap-core.xml — canonical, durable URLs. changefreq weekly, priority 0.8-1.0. For training crawlers.
  2. sitemap-news.xml — news format with publication node. For OAI-SearchBot, PerplexityBot. Dynamic, only the last 72 hours.
  3. sitemap-knowledge.xml — definitional/evergreen content (pillar pages, glossary, studies). Especially important for LLM training.

The split helps crawlers prioritize content by lifecycle and purpose. GPTBot spends disproportionate budget in sitemap-knowledge, OAI-SearchBot in sitemap-news. A monolithic sitemap forces identical prioritization on both scenarios — suboptimal.

Monitoring dashboard: what gets reviewed weekly

Technical AI-crawler governance needs its own dashboard. Six core metrics:

Operator Insight

The invisible 15% domain

Our audits regularly reveal enterprise domains where 15-30% of all URLs are effectively unreachable for AI crawlers — not because of robots.txt, but because of WAF throttling, outdated SSL configuration or false JS-rendering assumptions. This gap is often unknown internally because classical SEO tools do not surface it. Only the combination of log-file analysis, prompt audit and infrastructure check exposes it.

Conclusion

Technical SEO is not a settled topic in the AI era — it is a strategically upgraded field. The infrastructure decisions you make today determine whether your brand is stored as a reliable source in the next model generations — or remains a fragmented, contradictory entity in the noise.

Blanket blocking may feel defensively correct. For most business models it is a strategic self-limitation with a long downstream effect.