Google-Extended: AI Training Crawler

Definition: What is Google-Extended?

Google-Extended is a control token introduced by Google in robots.txt in September 2023. Deliberately, it is not a separate user agent but a semantic control signal: Google-Extended decouples the use of crawled content for training Gemini and associated generative-AI products from classical use in Google Search. The crawler remains Googlebot; Google-Extended only governs downstream data usage.

The goal: give publishers a granular choice without forcing the binary "block Googlebot or not". Whoever sets Google-Extended to Disallow stays in search rankings and remains AI-Overview-eligible (see below) - while blocking the use of their content in future AI model training runs. The token is Google's response to the publisher debate that arose around GPTBot and the training question in 2023.

Core idea

Google-Extended separates training from search - but not from AI Overviews

The block stops Gemini training. It does not stop use in AI Overviews, which draw on the live search index. Anyone who wants to disappear from AI Overviews would have to block Googlebot itself - and lose search entirely.

Google-Extended vs. Googlebot

The crucial difference is functional, not in the crawl mechanism. Googlebot is the actual crawler that fetches content for the search index. Blocking Googlebot means: the domain disappears from Google Search. Google-Extended is not a crawler; it is a directive that influences Google's internal data-usage pipeline. The content is still crawled (by Googlebot), but not used for AI training.

Operationally, this means a publisher can allow Googlebot (keeping search), block Google-Extended (no training input), and decide on GPTBot separately. The three decisions are independent. For enterprise domains this is the standard matrix in any crawler-policy review.

Syntax and implementation

Google-Extended is configured as a standalone User-agent block in robots.txt. Example for a full block:

User-agent: Google-Extended
Disallow: /

For selective control - blog content trainable, member area blocked:

User-agent: Google-Extended
Allow: /blog/
Allow: /glossary/
Disallow: /customers/
Disallow: /internal/

The order of the blocks in robots.txt is irrelevant - Googlebot, Google-Extended and other user agents are evaluated independently. For full control in the AI era, a combined directive belongs in the standard template:

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /customers/

User-agent: GPTBot
Disallow: /customers/

User-agent: CCBot
Disallow: /

The strategic trade-off question

The decision to block or allow Google-Extended is not technical but strategic. The two positions:

Block: The publisher withholds training input from Google. This protects content investments from being used without direct compensation. The price: your own entity and topic expertise do not flow into Gemini. Over the long term this means: fewer mentions, fewer citations, lower Share of Model. For paywall publishers and licensing strategies this is acceptable - they monetize through separate contracts.

Allow: The publisher accepts training without direct monetization. The price: content investments are used indirectly. The gain: the brand remains visible in Gemini-based answers, AI Overviews enjoy additional entity anchoring, co-occurrence with topics builds up. For brand-led publishers and B2B trade portals this is the standard choice.

Typical mistakes in Google-Extended strategies

Confusion with Googlebot. User-agent: Googlebot / Disallow: / on the assumption it would only block AI training. In reality it excludes the domain from Google Search entirely.
Expecting it to block AI Overviews. Google-Extended concerns training, not live retrieval. AI Overviews use the normal search index.
Unclear corporate policy. Legal, marketing and engineering hold different positions. The result: contradictory configurations after every deployment.
Forgetting other AI crawlers. Google-Extended blocked, but GPTBot, ClaudeBot and PerplexityBot left open. The decision then only takes effect against Google.
No documentation. The next CMS update overwrites the robots.txt directive. Without a security-review trail this stays undetected.

Related terms

Google-Extended belongs to the line of AI crawler and training controls: GPTBot, ClaudeBot, CCBot, PerplexityBot. Technically it is controlled via robots.txt, complemented editorially by llms.txt. For inclusion in AI Overviews, the distinction from Googlebot is decisive.

FAQ on Google-Extended

What is Google-Extended? ▾

Google-Extended is not its own crawler but a robots.txt token introduced by Google in September 2023. It lets publishers control the use of their content for training Gemini and related AI products - decoupled from Googlebot and classical Google Search. The crawler user-agent remains Googlebot.

How does Google-Extended differ from Googlebot? ▾

Googlebot crawls for Google Search; the index feeds rankings and classical SERPs. Google-Extended is a separate control token for using the same crawled data in AI training. Blocking Google-Extended keeps the domain in normal search but excludes its content from future Gemini training runs.

Does Google-Extended exclude AI Overviews? ▾

No - AI Overviews use the live index, not the Google-Extended training data. Blocking Google-Extended therefore does not exclude a domain from AI Overviews. Anyone who wants to disappear from AI Overviews would have to block Googlebot itself - and lose Google Search entirely.

How do you set Google-Extended in robots.txt? ▾

As a User-agent block: User-agent: Google-Extended / Disallow: /. The syntax is identical to other crawlers. Google respects the directive per its own documentation. The block applies only to future AI training runs, not to already collected data.

Should I block Google-Extended? ▾

The trade-off is strategic. Blocking means: your content does not flow into Gemini training - the brand loses AI visibility over the long term. Allowing means: training at your expense without direct monetization. For publishers with strong brand positioning, allowing is usually the better choice. For paywall publishers with licensing monetization, blocking is.

Definition: What is Google-Extended?

Google-Extended separates training from search - but not from AI Overviews

Google-Extended vs. Googlebot

Syntax and implementation

The strategic trade-off question

Typical mistakes in Google-Extended strategies

Related terms

FAQ on Google-Extended

Further reading

Technical SEO in the AI crawler era - audit framework

The quiet revolution - AI Overviews & 41% traffic absorption

GEO vs. SEO - why Generative Engine Optimization is a new discipline

Gemini SEO: knowledge graph and multimodal.

Optimize AI Overviews.

ChatGPT SEO: bot matrix and passages.

Crawler policy for your domain.