GPTBot: OpenAI's Crawler - robots.txt & WAF

Q: What is GPTBot?

GPTBot is OpenAI's web crawler, introduced in August 2023. It collects publicly accessible web content as training material for future GPT model generations. The user agent is simply 'GPTBot'. In parallel, OAI-SearchBot operates for the ChatGPT Search functionality and ChatGPT-User for live fetches during a user session.

Q: How do you block GPTBot?

Via an entry in robots.txt: User-agent: GPTBot / Disallow: /. OpenAI documents this officially and respects the directive. The block applies to future crawling; already trained models retain the content. For full control you additionally need to address OAI-SearchBot and ChatGPT-User.

Q: What is the 429 problem?

Many sites unintentionally block GPTBot via WAF rules or rate limits that produce 429 Too Many Requests. The crawler is then classified as suspicious traffic and discarded, even though robots.txt allows access. In 2024-2025 this was the most common reason for missing ChatGPT citations despite an open crawler policy.

Q: Does GPTBot differ from OAI-SearchBot?

Yes. GPTBot collects training data for model updates. OAI-SearchBot indexes for the ChatGPT Search functionality - comparable to a classical search crawler. ChatGPT-User, in turn, fetches content live on a user request. All three user agents must be addressed separately in robots.txt.

Q: Does allowing GPTBot harm your own domain?

As a rule, no. GPTBot respects robots.txt, uses a published IP range (documented on platform.openai.com) and produces manageable load. Blocking excludes the domain entirely from future GPT model generations. For publishers who value LLM visibility, allowing it is the standard recommendation.

Definition: What is GPTBot?

GPTBot is OpenAI's official web crawler. It was introduced in August 2023 and collects publicly accessible content for future training runs of the GPT model family. The user agent is simply GPTBot; the full user-agent string is documented at platform.openai.com/docs/gptbot, as are the official IP ranges in a JSON file. Thanks to published IP ranges the crawler can be reliably distinguished from spoofing attempts.

An important distinction: GPTBot is not OpenAI's only user agent. For the ChatGPT Search function, OAI-SearchBot operates; for live fetches during a user session, ChatGPT-User. The three user agents have different purposes and must be addressed separately in robots.txt. Blocking "GPTBot" does not automatically block ChatGPT Search - a common misunderstanding in enterprise policies.

Core idea

The most common blocking problem is not robots.txt but the WAF

Many domains allow GPTBot in robots.txt - but block it via Cloudflare, Akamai or AWS WAF with 429 rate limits. The result: unintended invisibility in ChatGPT despite an open crawler policy.

Control via robots.txt

The primary control happens via the robots.txt file at the website root. Basic syntax for a full block:

User-agent: GPTBot
Disallow: /

For selective control - opening only the blog, blocking the members area:

User-agent: GPTBot
Allow: /blog/
Disallow: /

For allowing with fine-grained control across multiple OpenAI user agents:

User-agent: GPTBot
Disallow: /internal/
Disallow: /customers/

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

OpenAI respects the robots.txt directive consistently per its own documentation. Independent traffic analyses have confirmed this reliably since 2023 - documented deviations do not exist.

The 429 problem: WAF blocks despite an open robots.txt

In practice GPTBot fails on many domains not at robots.txt but at the Web Application Firewall. Cloudflare, Akamai, AWS WAF, Imperva and Fastly carry default rules that aggressively rate-limit bot traffic from unknown user agents or respond with challenge prompts. GPTBot cannot solve CAPTCHAs and consequently receives HTTP 429 Too Many Requests or 403 Forbidden - even though the robots.txt directive explicitly allows it.

The pattern is widespread. Per Cloudflare data from 2024, the share of domains blocking GPTBot via WAF while not blocking it in robots.txt sits in the double-digit percentages. For publishers who value LLM visibility this is an operational problem: the crawler is classified as aggressive, content remains untrained, and the brand disappears from future ChatGPT answers.

Practice: WAF policy for GPTBot

Verification. Pull the official IP range from OpenAI's JSON list (openai.com/gptbot-ranges.json) and mark it as trusted.
Raise rate limits. Allow at least 100 requests/minute for verified GPTBot IPs.
Disable challenges. No JavaScript or CAPTCHA challenges for GPTBot IPs. The crawler does not render JavaScript.
Logging. Use a separate log category for GPTBot traffic. Monthly check of 4xx/5xx response rates. Target: < 2 percent 429/403.
Documentation. Record the WAF exception in the security policy so it does not get wiped during routine updates.

In parallel you decide which areas to keep blocked via robots.txt: members areas, internal tools, customer dashboards. Public content - blog, glossary, product pages, press area - typically belongs in the allowed zone.

Typical mistakes in GPTBot strategies

Blanket block out of uncertainty. User-agent: * / Disallow: / on the belief it would only "block AI". In reality it also blocks Google and every serious crawler. A common mistake in CMS default configurations.
Unnoticed WAF block. robots.txt is open, but logs show hundreds of 429s from GPTBot. Without separate WAF monitoring the problem stays invisible.
Confusion between user agents. Only GPTBot blocked, OAI-SearchBot and ChatGPT-User forgotten. Result: model training blocked, live search still open.
No IP verification. Requests with user agent "GPTBot" are accepted wholesale. Spoofers abuse this for unprotected scraper access. Trust only IPs from the OpenAI JSON list.
Undocumented policy. The security team resets everything to defaults in the next WAF housekeeping. GPTBot gets blocked again.

Related terms

GPTBot is part of the AI crawler landscape alongside Google-Extended, ClaudeBot, PerplexityBot and CCBot. Control happens via robots.txt, supplemented by llms.txt and WAF policy. Strategically, GPTBot belongs in any GEO and LLM SEO infrastructure.

FAQ on GPTBot

What is GPTBot? ▾

GPTBot is OpenAI's web crawler, introduced in August 2023. It collects publicly accessible web content as training material for future GPT model generations. The user agent is simply 'GPTBot'. In parallel, OAI-SearchBot operates for the ChatGPT Search functionality and ChatGPT-User for live fetches during a user session.

How do you block GPTBot? ▾

Via an entry in robots.txt: User-agent: GPTBot / Disallow: /. OpenAI documents this officially and respects the directive. The block applies to future crawling; already trained models retain the content. For full control you additionally need to address OAI-SearchBot and ChatGPT-User.

What is the 429 problem? ▾

Many sites unintentionally block GPTBot via WAF rules or rate limits that produce 429 Too Many Requests. The crawler is then classified as suspicious traffic and discarded, even though robots.txt allows access. In 2024-2025 this was the most common reason for missing ChatGPT citations despite an open crawler policy.

Does GPTBot differ from OAI-SearchBot? ▾

Yes. GPTBot collects training data for model updates. OAI-SearchBot indexes for the ChatGPT Search functionality - comparable to a classical search crawler. ChatGPT-User, in turn, fetches content live on a user request. All three user agents must be addressed separately in robots.txt.

Does allowing GPTBot harm your own domain? ▾

As a rule, no. GPTBot respects robots.txt, uses a published IP range (documented on platform.openai.com) and produces manageable load. Blocking excludes the domain entirely from future GPT model generations. For publishers who value LLM visibility, allowing it is the standard recommendation.

Definition: What is GPTBot?

The most common blocking problem is not robots.txt but the WAF

Control via robots.txt

The 429 problem: WAF blocks despite an open robots.txt

Practice: WAF policy for GPTBot

Typical mistakes in GPTBot strategies

Related terms

FAQ on GPTBot

Further reading

Technical SEO in the AI crawler era - audit framework

The quiet revolution - AI Overviews & 41% traffic absorption

Prompt-Level SEO - the playbook for ChatGPT citations

ChatGPT SEO: the complete bot matrix.

Bing Copilot SEO: IndexNow and the Bing index.

Claude citations: a ClaudeBot strategy.

Crawler audit for your domain.