Google AI Crawler: Track and Control Google-Extended
Google-Extended is Google's AI training crawler that feeds Gemini and Bard. It shares IP addresses with Googlebot, making it impossible to block by IP alone. Learn how to detect, monitor, and control it without affecting your search rankings.
Google-Extended feeds Gemini from your content — while sharing every IP with Googlebot
Track and control Google's AI data collection separately from search indexing
Jump to section
What Is the Google AI Crawler Google-Extended?
Google-Extended is Google's dedicated AI training crawler. Introduced in September 2023, its sole purpose is to collect website content that Google uses to train Gemini, improve Bard responses, and generate AI Overviews in search results. It operates under the user-agent token Google-Extended and is entirely separate from Googlebot's search indexing function.
Unlike Googlebot, which crawls your site to build Google's search index and determine your rankings, Google-Extended crawls your site to feed Google's generative AI products. The distinction matters because blocking one does not affect the other. You can prevent Google from training AI on your content while maintaining full visibility in search results.
Google introduced Google-Extended in response to growing demand from publishers who wanted control over how their content was used for AI training. Before September 2023, there was no way to opt out of Google's AI data collection without also disappearing from search results entirely.
Google-Extended User Agent
Google-Extended identifies itself with the user-agent token "Google-Extended" in robots.txt. However, its HTTP requests use a Googlebot-compatible user-agent string, which is why log-based detection requires more than simple string matching.
Google-Extended vs Googlebot: The Critical Difference
Google operates multiple crawlers, but the two most important for website owners are Googlebot and Google-Extended. Googlebot crawls your site to index pages for Google Search. Google-Extended crawls your site to collect training data for Gemini and other AI products. They are controlled independently via robots.txt.
| Feature | Googlebot | Google-Extended |
|---|---|---|
| Purpose | Search indexing and ranking | AI training data for Gemini, Bard, AI Overviews |
| User-Agent Token | Googlebot | Google-Extended |
| IP Ranges | Google IP pool | Same Google IP pool (shared) |
| Robots.txt Control | User-agent: Googlebot | User-agent: Google-Extended |
| Blocking Impact | Removes pages from Google Search | Stops AI training only — no SEO impact |
| Introduced | 2004 | September 2023 |
This shared-IP architecture is unique to Google. OpenAI's GPTBot and Anthropic's ClaudeBot use their own dedicated IP ranges, making IP-based blocking straightforward. With Google-Extended, you must rely entirely on the robots.txt user-agent directive.
What Google-Extended Collects and Why
Google-Extended collects text content, structured data, and metadata from your website pages. This content feeds directly into Google's generative AI pipeline. Specifically, it is used to train the Gemini family of models, generate AI Overviews that appear at the top of search results, and improve conversational responses in Bard.
The scope of collection is broad. Google-Extended does not limit itself to blog posts or articles. It crawls product pages, documentation, FAQs, forum threads, and any other publicly accessible content. If Googlebot can see it, Google-Extended can collect it for AI training.
For publishers and content creators, this raises a fundamental question: should your content be used to train AI models that may reduce the need for users to visit your site? AI Overviews, for example, often provide answers directly in search results, potentially reducing click-through rates to the original source.
AI Overviews May Reduce Your Traffic
Content collected by Google-Extended can appear in AI Overviews at the top of search results. These AI-generated summaries often answer user queries without requiring a click to your site, which some publishers report reduces organic click-through rates by 15-30%.
Google-Extended feeds these AI products
- Gemini (formerly Bard) — conversational AI responses trained on web content
- AI Overviews — AI-generated summaries shown directly in Google Search results
- Google SGE (Search Generative Experience) — experimental AI search features
- Vertex AI — enterprise AI models that Google offers to business customers
Bring External Site Data Into Copper
Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.
How to Block the Google AI Crawler Without Affecting SEO
Blocking Google-Extended is straightforward via robots.txt. Add a User-agent directive for Google-Extended with a Disallow rule, and Google will stop using your content for AI training. Your search rankings, indexing, and Googlebot crawling remain completely unaffected.
User-agent: Google-Extended
Disallow: /User-agent: Google-Extended
Disallow: /premium/
Disallow: /research/
Disallow: /docs/api/
Allow: /blog/
Allow: /marketing/Steps to block Google-Extended
- Open your robots.txt file (usually at /robots.txt in your site root)
- Add "User-agent: Google-Extended" on a new line
- Add "Disallow: /" to block all pages, or "Disallow: /premium/" for selective blocking
- Save and deploy the updated robots.txt file
- Verify by visiting yoursite.com/robots.txt in a browser
- Monitor with Copper Analytics to confirm Google-Extended crawl requests stop within 48 hours
After adding the robots.txt rule, Google typically begins honoring it within 24-48 hours. However, like all robots.txt directives, blocking is not retroactive. Content that Google-Extended already collected before you added the rule may still be used in existing AI model training sets.
Detecting Google-Extended Activity in Your Server Logs
Detecting Google-Extended in server logs is tricky because it shares user-agent strings and IP addresses with Googlebot. You will not see a clean "Google-Extended" string in your access logs. Instead, you will see standard Googlebot user-agent strings from Google IP addresses, and there is no reliable way to distinguish which requests are for search indexing versus AI training from logs alone.
This is fundamentally different from detecting GPTBot or ClaudeBot, which have unique user-agent strings that are easy to grep for. Google designed its crawling infrastructure so that Google-Extended operates within the same technical footprint as Googlebot.
Copper Separates the Signal
Copper Analytics uses behavioral analysis and crawl pattern heuristics to distinguish Google-Extended activity from regular Googlebot search indexing. The Crawlers dashboard shows Google-Extended as a separate line item with its own request counts, page targets, and frequency trends.
The practical implication is that log analysis alone cannot tell you how much of your Google crawler traffic is AI training versus search indexing. You need a tool that understands the behavioral differences — crawl patterns, request timing, and page selection heuristics — to separate the two.
Monitoring Google-Extended with Copper Analytics
Copper Analytics provides dedicated Google AI crawler tracking that separates Google-Extended from Googlebot in your analytics dashboard. You get a clear picture of how often Google's AI crawler visits, which pages it targets, and how its activity trends over time — data that is impossible to extract from server logs alone.
The Crawlers dashboard displays Google-Extended alongside other AI crawlers like GPTBot, ClaudeBot, and Bytespider. For each crawler, you see daily request counts, most-visited pages, bandwidth consumption, and crawl frequency patterns. This data is essential for making informed robots.txt decisions.
Setting up tracking takes one line of code. Copper's lightweight tracking script detects all AI crawlers automatically, including Google-Extended, without any additional configuration. Data appears in your dashboard within minutes of installation.
Track Google-Extended on Your Website for Free
Copper Analytics separates Google-Extended from Googlebot traffic. See exactly how Google's AI crawler interacts with your content.
Google AI Crawler FAQ
What is Google-Extended?
Google-Extended is Google's AI training crawler, introduced in September 2023. It collects website content to train Gemini, generate AI Overviews, and improve Bard responses. It is separate from Googlebot, which handles search indexing.
Does blocking Google-Extended affect my Google search rankings?
No. Blocking Google-Extended via robots.txt only stops AI training data collection. Your Googlebot crawling, search indexing, and rankings remain completely unaffected.
Can I block Google-Extended by IP address?
No. Google-Extended shares IP addresses with Googlebot. Blocking by IP would also block Googlebot and remove your site from Google Search. Use robots.txt User-agent directives instead.
How do I know if Google-Extended is crawling my site?
Standard server logs cannot reliably distinguish Google-Extended from Googlebot because they share user-agent strings and IP ranges. Use Copper Analytics for accurate Google-Extended detection and monitoring.
What content does Google-Extended collect?
Google-Extended collects all publicly accessible text content, structured data, and metadata. This includes blog posts, product pages, documentation, FAQs, and forum threads — anything Googlebot can access.
What to Do Next
The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.
You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.