Technical SEO for AI crawlers (GPTBot & Co)

In 2026, the server logs of a typical enterprise website show a new pattern. Alongside Googlebot, Bingbot and the classical SEO bots, new user agents appear with rising frequency: GPTBot, OAI-SearchBot, ClaudeBot, anthropic-ai, PerplexityBot, CCBot, Google-Extended, Applebot-Extended. On many domains, these crawlers now make up 15-35% of bot traffic — and the trend is upward.

The widespread reflex — blanket blocking via robots.txt or Cloudflare rules — is strategically fatal. Block GPTBot and you are telling OpenAI: "Please train your next model without our content." That may be legally sensible for publisher properties; for almost every other business model it is existentially threatening in the long run.

The most relevant AI crawlers at a glance

The overview below shows the twelve most relevant AI crawlers in 2026 with operator, purpose and exact user-agent string for robots.txt and log-file configuration.

Crawler	Operator	Training data	Live search	User-agent string
GPTBot	OpenAI	yes	no	`GPTBot`
OAI-SearchBot	OpenAI	no	yes (ChatGPT Search)	`OAI-SearchBot`
ChatGPT-User	OpenAI	no	yes (browsing mode)	`ChatGPT-User`
ClaudeBot	Anthropic	yes	no	`ClaudeBot`
Claude-Web	Anthropic	no	yes (Claude.ai browsing)	`Claude-Web`
PerplexityBot	Perplexity	no	yes	`PerplexityBot`
Perplexity-User	Perplexity	no	yes (user-initiated)	`Perplexity-User`
Google-Extended	Google	yes (Gemini, AI Overviews)	no	`Google-Extended`
Applebot-Extended	Apple	yes (Apple Intelligence)	no	`Applebot-Extended`
CCBot	Common Crawl	yes (training corpus)	no	`CCBot`
Meta-ExternalAgent	Meta	yes (Llama)	no	`Meta-ExternalAgent`
Bytespider	ByteDance	yes	no	`Bytespider`

Note on classification: "Training data: yes" means the crawler collects content for future model iterations — a block only affects your brand's model representation months later. "Live search: yes" marks crawlers that pull content in real time for answers inside search products (ChatGPT Search, Perplexity, Claude.ai browsing) — a block removes you immediately from citable sources. This distinction is the strategic basis for any differentiated robots.txt.

OpenAI: GPTBot and OAI-SearchBot

OpenAI runs two different crawlers with distinct functions. GPTBot collects training data for future model iterations. OAI-SearchBot and ChatGPT-User are activated for live retrieval on search queries inside ChatGPT. The distinction is fundamental: block GPTBot but allow OAI-SearchBot, and you can still be cited in ChatGPT — but you will no longer migrate into model memory.

Anthropic: ClaudeBot & anthropic-ai

Anthropic runs ClaudeBot for training-data collection and anthropic-ai / Claude-Web for specific product features. Its robots.txt compliance rate is higher than for some competitors, which simplifies block control.

Google-Extended

A separate user agent that applies exclusively to training for Google's Bard/Gemini products — not for the classical Googlebot index. Block Google-Extended and your site disappears from Gemini training, but remains in the Google search index.

Apple: Applebot-Extended

Analogous to Google-Extended: an opt-out user agent that Apple introduced in 2024. It blocks training for Apple Intelligence products without touching the regular Siri index.

PerplexityBot & CCBot

PerplexityBot collects for Perplexity's hybrid search system. CCBot is the crawler of the Common Crawl project, whose data in turn serves as a training base for nearly every large LLM. Blocking CCBot has cascading effects across many models.

The strategic core decision: what to allow, what to block?

The answer depends on the business model. Three main scenarios:

Scenario 1: brands and service providers (default recommendation)

For brands, service providers, B2B vendors and most corporate sites, AI visibility is a marketing asset, not a content loss. Allow every relevant AI crawler, control crawl budget, monitor server load.

Scenario 2: publishers and news outlets

A more complex trade-off space. Blanket blocking protects current content value but costs future relevance in the AI era. Many top publishers now run a hybrid course: block GPTBot (training) and allow OAI-SearchBot (live retrieval). You stay citable but prevent training appropriation.

Scenario 3: sensitive or legally exposed content

Pages with personal data, legally protected content or paywalled material: blanket block, plus IP-based rate limits.

Robots.txt: the right configuration

A clean, differentiated robots.txt for the default scenario:

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

# Sensitive areas
User-agent: *
Disallow: /checkout/
Disallow: /account/
Disallow: /internal/

A frequent mistake: User-agent: * with Disallow: / blocks every crawler — including every AI crawler that respects robots.txt. Differentiation is essential.

Managing crawl budget and server load

AI crawlers create real costs. On an enterprise site with 50,000 pages, the daily crawl volume from several AI bots can add multiple gigabytes of traffic per day. Without control, this leads to:

Higher cloud bills (CDN bandwidth, origin requests)
Rate-limit issues on backend APIs
Degraded user experience under insufficient capacity

Practical measures:

Optimize caching for bot traffic: aggressive edge-caching strategies for HTML (CDN level), because AI bots usually parse only HTML and do not need JS execution
Crawl-delay in robots.txt: Crawl-delay: 5 (seconds between requests) is respected by many AI crawlers
Cloudflare/Fastly bot management: differentiated rate limits per user agent
Sitemap optimization: prioritize the most important content; do not list less important pages in the sitemap

JavaScript rendering and AI crawlers

A critical technical point that is often underestimated: most AI crawlers do not execute JavaScript. While Googlebot renders complex pages through Chromium Headless, GPTBot, ClaudeBot and PerplexityBot see only the initially served HTML. Dynamic content loaded client-side via React/Vue/Angular is invisible to these crawlers.

The concrete consequences:

Single-page applications (SPAs) must use server-side rendering (SSR) or static site generation (SSG) to be visible to LLMs
Infinite-scroll content is mostly missed — relevant content must be delivered initially
Lazy-loaded content (images, sections) needs fallback structures in the source HTML
JSON-LD in the source HTML works more reliably than dynamically injected schema markup

15-35%

Bot traffic from AI crawlers on enterprise sites (2026)

JavaScript rendering on most AI crawlers

7-14 days

Typical crawl interval for top pages

Operator Insight

The status code that hurts

An often-overlooked factor: 429 Too Many Requests and 503 Service Unavailable to AI crawlers signal to the system, over time, that the source is unreliable. Several large LLM providers reduce crawl frequency after repeated errors or deprioritize the source for future training runs. An under-dimensioned server can systematically erode your AI visibility — without classical SEO reports catching it.

Structured data: the LLM accelerator

Where classical SEO teams treat schema markup as a CTR booster for rich snippets, structured markup has a more fundamental function in the AI era: it reduces ambiguity for models and raises the probability of correct information extraction.

Especially effective:

Organization with a full sameAs array (Wikipedia, Wikidata, LinkedIn, Crunchbase)
Article with a clear author entity (as Person schema, not just a name)
DefinedTerm for concept definitions
FAQPage with clearly answered questions
HowTo with structured steps

Monitoring: what you should measure

A modern technical-SEO monitor actively includes AI crawlers:

Bot traffic by user agent: log daily, review monthly
Response-code distribution per bot: 200s should be > 95%
Crawl depth per bot: which directories are visited? Are important sections missing?
Crawl-frequency trends: is attention from specific AI systems rising or falling?
Correlation with LLM visibility: reconcile prompt-audit results with crawl activity

The complete robots.txt for differentiated AI-crawler access

A typical production setup for a B2B brand with high reputation interest that also protects monetized archives:

# SUMAX Enterprise Reference Configuration
# Last updated: 2026-03-01

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Allow: /
Disallow: /members/
Disallow: /internal/

User-agent: GPTBot
Allow: /
Disallow: /members/
Disallow: /pricing-calculator/
Disallow: /internal/

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /
Disallow: /members/

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Allow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: *
Allow: /
Disallow: /cgi-bin/
Disallow: /search?
Disallow: /*?utm_
Disallow: /print/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

The distinctions matter: GPTBot (training) must not see monetizable assets; OAI-SearchBot (live retrieval for ChatGPT Search) sees everything, because that is where citation value is created. Google-Extended is not blanket-blocked — block it and you disappear from AI Overviews, even though regular ranking stays intact. One of the most common strategic mistakes of 2024.

Log-file analysis: the operational gold standard

Crawler behaviour cannot be measured with SEO tools — only with server logs. A minimal setup for AI-crawler analysis:

# Extract AI-crawler hits from Apache/Nginx log (awk/cut)
# Fields: IP, UserAgent, Status, Path, Timestamp

grep -E "GPTBot|OAI-SearchBot|ClaudeBot|PerplexityBot|Google-Extended" access.log \
  | awk '{print $1, $7, $9, $NF}' \
  > ai_crawler_hits.tsv

# Aggregation: hits per bot per day per path pattern
# Target metrics:
#   - hit rate per path cluster (/blog/*, /product/*, /case-study/*)
#   - 2xx rate per bot (target > 97%)
#   - median response time per bot (target < 600 ms)
#   - re-crawl interval (median delta between two hits of the same path)

A healthy crawl pattern for enterprise domains:

4-12k

GPTBot hits/day for mid-sized enterprise domains (~5k URLs)

3-7 days

Typical re-crawl interval for top content

> 97%

Target 2xx rate per AI crawler

JavaScript rendering: the invisible citation barrier

Every AI crawler in the wild — with the exception of OAI-SearchBot and PerplexityBot — does not render JavaScript. They read only the initial HTML document. Anything loaded client-side simply does not exist for them.

Practical consequences:

SPA architectures without SSR are invisible to training crawlers. React pages with CSR only deliver an empty <div id="root"></div> to GPTBot.
Cookie walls in front of content prevent any citation. Even if Google sees the content later, the training crawl already left empty-handed.
Lazy-loaded text blocks are not captured. Anything that is faded in "further down" via JS is invisible to trainers.
Web components without a light-DOM fallback are equally opaque.

Solutions, ranked by effort:

Activate SSR. Next.js, Nuxt, Remix ship static HTML out of the box. Minimum effort, maximum effect.
Dynamic rendering (server-side renders for bots, client-side for users). Acceptable as a bridge, not recommended long term.
Prerendering. Static HTML snapshots on the CDN served on bot detection. Tools: Prerender.io, Rendertron.
Content migration to MDX/Markdown sources with static build. The cleanest solution for content platforms.

Rate limiting, CDN policy and the 429 dead-zone effect

Aggressive WAF/CDN rules (Cloudflare, Akamai, Fastly) often block AI crawlers unnoticed. Typical scenario: the WAF sees an unusual user-agent pattern, classifies it as bot traffic, throttles to 10 req/min. GPTBot hits the limit, receives 429 Too Many Requests and backs off — for weeks. The domain disappears from LLM outputs even though robots.txt is clean.

Controls:

Explicitly allowlist WAF rules for verified AI-crawler IPs (OpenAI publishes IP ranges; Anthropic does too)
Verify via reverse DNS + forward DNS, not just UA string (UA spoofing is trivial)
Rate limits for AI crawlers at least 10× higher than standard bot limits
Monitoring: review 4xx/5xx rates per bot weekly

Sitemap strategy: separate signals for separate purposes

A single sitemap.xml is no longer sufficient for modern AI infrastructure. We recommend a three-sitemap structure:

sitemap-core.xml — canonical, durable URLs. changefreq weekly, priority 0.8-1.0. For training crawlers.
sitemap-news.xml — news format with publication node. For OAI-SearchBot, PerplexityBot. Dynamic, only the last 72 hours.
sitemap-knowledge.xml — definitional/evergreen content (pillar pages, glossary, studies). Especially important for LLM training.

The split helps crawlers prioritize content by lifecycle and purpose. GPTBot spends disproportionate budget in sitemap-knowledge, OAI-SearchBot in sitemap-news. A monolithic sitemap forces identical prioritization on both scenarios — suboptimal.

Monitoring dashboard: what gets reviewed weekly

Technical AI-crawler governance needs its own dashboard. Six core metrics:

Crawler coverage: share of the URL population visited at least once by every relevant AI crawler in the past 30 days. Target: > 85%.
Response quality: 2xx rate per bot. Target: > 97%.
Re-crawl latency: median interval between updates and the first re-crawl. Target: < 7 days for top content.
Blocked ratio: 4xx/5xx or 429 responses per bot. Target: < 2%.
Rendered-content ratio: Lighthouse-based check on which share of content is visible pre-JS. Target: > 90%.
Citation correlation: match between heavily crawled paths and LLM citation outcomes from prompt audits.

Operator Insight

The invisible 15% domain

Our audits regularly reveal enterprise domains where 15-30% of all URLs are effectively unreachable for AI crawlers — not because of robots.txt, but because of WAF throttling, outdated SSL configuration or false JS-rendering assumptions. This gap is often unknown internally because classical SEO tools do not surface it. Only the combination of log-file analysis, prompt audit and infrastructure check exposes it.

Conclusion

Technical SEO is not a settled topic in the AI era — it is a strategically upgraded field. The infrastructure decisions you make today determine whether your brand is stored as a reliable source in the next model generations — or remains a fragmented, contradictory entity in the noise.

Blanket blocking may feel defensively correct. For most business models it is a strategic self-limitation with a long downstream effect.

Technical SEO for AI crawlers: how to steer GPTBot, ClaudeBot & Co strategically.

The most relevant AI crawlers at a glance

OpenAI: GPTBot and OAI-SearchBot

Anthropic: ClaudeBot & anthropic-ai

Google-Extended

Apple: Applebot-Extended

PerplexityBot & CCBot

The strategic core decision: what to allow, what to block?

Scenario 1: brands and service providers (default recommendation)

Scenario 2: publishers and news outlets

Scenario 3: sensitive or legally exposed content

Robots.txt: the right configuration

Managing crawl budget and server load

Practical measures:

JavaScript rendering and AI crawlers

The status code that hurts

Structured data: the LLM accelerator

Monitoring: what you should measure

The complete robots.txt for differentiated AI-crawler access

Log-file analysis: the operational gold standard

JavaScript rendering: the invisible citation barrier

Rate limiting, CDN policy and the 429 dead-zone effect

Sitemap strategy: separate signals for separate purposes

Monitoring dashboard: what gets reviewed weekly

The invisible 15% domain

Conclusion

Murat Ulusoy

How AI-crawl-fit is your infrastructure?

The most relevant AI crawlers at a glance

OpenAI: GPTBot and OAI-SearchBot

Anthropic: ClaudeBot & anthropic-ai

Google-Extended

Apple: Applebot-Extended

PerplexityBot & CCBot

The strategic core decision: what to allow, what to block?

Scenario 1: brands and service providers (default recommendation)

Scenario 2: publishers and news outlets

Scenario 3: sensitive or legally exposed content

Robots.txt: the right configuration

Managing crawl budget and server load

Practical measures:

JavaScript rendering and AI crawlers

The status code that hurts

Structured data: the LLM accelerator

Monitoring: what you should measure

The complete robots.txt for differentiated AI-crawler access

Log-file analysis: the operational gold standard

JavaScript rendering: the invisible citation barrier

Rate limiting, CDN policy and the 429 dead-zone effect

Sitemap strategy: separate signals for separate purposes

Monitoring dashboard: what gets reviewed weekly

The invisible 15% domain

Conclusion

Murat Ulusoy

How AI-crawl-fit is your infrastructure?

Related insights

Prompt-level SEO: how brands appear systematically in ChatGPT answers.

The end of keyword research: topical maps as the new foundation.

Multilingual LLM SEO: how visibility scales across language markets.

ChatGPT SEO: bot matrix and passage engineering.

Bing Copilot SEO: IndexNow and M365.

Gemini SEO: Google-Extended and multimodal.