How to Block AI Crawlers (GPTBot, ClaudeBot, and Others)
AI training bots are crawling your website right now. Here's how to control which ones get access to your content — and which ones get blocked at the door.
At a Glance
- AI crawlers like GPTBot, ClaudeBot, and Bytespider scrape your content for model training — often without asking.
- You can block AI bots using robots.txt, .htaccess, Nginx configs, Cloudflare WAF, or HTML meta tags.
- Blocking Google-Extended prevents AI feature inclusion but does not affect regular search indexing.
- Not all AI companies respect robots.txt — server-level blocking and analytics monitoring are essential fallbacks.
- The
noaiandnoimageaimeta tags provide an additional opt-out signal at the page level. - Copper Analytics lets you track which AI crawlers are hitting your site so you can verify your blocks are working.
Jump to section
Why You Might Want to Block AI Crawlers
AI companies are crawling the web at unprecedented scale. OpenAI's GPTBot, Anthropic's ClaudeBot, ByteDance's Bytespider, and dozens of others are scraping website content to train large language models. For site owners, this raises three immediate concerns.
Bandwidth and Server Load
AI crawlers are aggressive. Unlike Googlebot, which is designed to crawl politely with rate limits, some AI bots hammer your server with hundreds of requests per minute. Bytespider in particular has earned a reputation for excessive crawling that can spike bandwidth costs and slow down your site for real visitors. If you're on a metered hosting plan, these bots are literally costing you money.
Content Licensing and IP Protection
When an AI model trains on your content, your words become part of its output — without attribution, compensation, or consent. Publishers, journalists, and creators have legitimate reasons to block AI crawlers from accessing their work. Whether you view this as a copyright issue or a business decision, the right to opt out is yours.
Your Right to Opt Out
Most major AI companies now offer opt-out mechanisms via robots.txt. OpenAI, Anthropic, Google, and others have published their crawler user agents and committed to respecting disallow rules. The EU AI Act and similar regulations are reinforcing the principle that content owners should control how their data is used for AI training.
Complete List of AI Crawler User Agents
Before you start blocking, you need to know exactly which bots to target. Here is a comprehensive list of known AI crawler user agents as of early 2026, organized by company:
| User Agent | Company | Purpose |
|---|---|---|
| GPTBot | OpenAI | Model training data collection |
| ChatGPT-User | OpenAI | Real-time browsing for ChatGPT users |
| OAI-SearchBot | OpenAI | SearchGPT web search indexing |
| ClaudeBot | Anthropic | Model training data collection |
| Bytespider | ByteDance | Training data for TikTok/Doubao AI |
| CCBot | Common Crawl | Open dataset used by many AI projects |
| Google-Extended | Gemini and AI Overviews training | |
| Amazonbot | Amazon | Alexa AI and product recommendations |
| FacebookBot | Meta | Llama model training |
| PerplexityBot | Perplexity AI | AI search engine indexing |
| Cohere-ai | Cohere | Enterprise AI model training |
| Diffbot | Diffbot | Knowledge graph and structured data extraction |
| AppleBot-Extended | Apple | Apple Intelligence training |
Important Distinction
GPTBot collects training data. ChatGPT-User is the real-time browsing agent used when ChatGPT users click “Browse the web.” Similarly, Google-Extended is for AI training only — blocking it does NOT affect your regular Google Search ranking. These are separate crawlers with separate purposes.
Method 1: robots.txt Rules
The simplest way to block AI bots from your website is through your robots.txt file. This file lives at the root of your domain (e.g., yoursite.com/robots.txt) and tells crawlers which areas they can and cannot access. Add the following rules to block all major AI crawlers:
# Block OpenAI crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
# Block Anthropic crawler
User-agent: ClaudeBot
Disallow: /
# Block ByteDance crawler
User-agent: Bytespider
Disallow: /
# Block Common Crawl (used by many AI projects)
User-agent: CCBot
Disallow: /
# Block Google AI training (not regular search)
User-agent: Google-Extended
Disallow: /
# Block Amazon AI crawler
User-agent: Amazonbot
Disallow: /
# Block Meta/Facebook AI crawler
User-agent: FacebookBot
Disallow: /
# Block Perplexity AI crawler
User-agent: PerplexityBot
Disallow: /
# Block Cohere AI crawler
User-agent: Cohere-ai
Disallow: /
# Block Diffbot crawler
User-agent: Diffbot
Disallow: /
# Block Apple Intelligence training
User-agent: AppleBot-Extended
Disallow: /Tip
Paste these rules at the end of your existing robots.txt file. Keep your existing rules for Googlebot, Bingbot, and other search engine crawlers intact — those are separate and unaffected. If you only want to protect certain directories (e.g., your blog), replace / with a specific path like /blog/.
Each User-agent directive targets a specific bot by name, and Disallow: / blocks access to your entire site. The limitation of robots.txt is that it is a request, not enforcement. Well-behaved bots respect it, but nothing technically prevents a rogue crawler from ignoring it entirely. That is why server-level blocking matters.
Method 2: .htaccess and Nginx Server Rules
For stronger protection, block AI crawlers at the server level. This returns a 403 Forbidden response before any content is served — the bot receives nothing at all, regardless of whether it respects robots.txt.
Apache .htaccess
If you run Apache, add these rewrite rules to your .htaccess file:
# Block AI crawlers at the server level
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChatGPT-User [NC,OR]
RewriteCond %{HTTP_USER_AGENT} OAI-SearchBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Amazonbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FacebookBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PerplexityBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Cohere-ai [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Diffbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} AppleBot-Extended [NC]
RewriteRule .* - [F,L]The [NC] flag makes the match case-insensitive, and [F,L] returns a 403 Forbidden and stops processing. The last condition in the chain omits the [OR] flag to properly close the rule.
Nginx Configuration
For Nginx servers, use a map directive in your server block. This is more efficient than chaining multiple if statements because Nginx evaluates it once per request:
# Block AI crawlers in Nginx
map $http_user_agent $is_ai_bot {
default 0;
~*GPTBot 1;
~*ChatGPT-User 1;
~*OAI-SearchBot 1;
~*ClaudeBot 1;
~*Bytespider 1;
~*CCBot 1;
~*Google-Extended 1;
~*Amazonbot 1;
~*FacebookBot 1;
~*PerplexityBot 1;
~*Cohere-ai 1;
~*Diffbot 1;
~*AppleBot-Extended 1;
}
server {
# ... your existing config ...
if ($is_ai_bot) {
return 403;
}
}The ~* prefix makes each pattern case-insensitive. After adding these rules, reload Nginx with sudo nginx -s reload. The advantage of server-level blocking is that the bot receives a 403 immediately — no content is ever transferred.
Method 3: Cloudflare WAF Rules
If your site is behind Cloudflare, you have access to built-in AI bot blocking without touching server configs. Cloudflare added a dedicated toggle in 2024 specifically for AI crawlers.
One-Click Blocking
- Go to your Cloudflare dashboard and select your domain.
- Navigate to Security → Bots.
- Enable “Block AI Scrapers and Crawlers.” This single toggle blocks all known AI training bots, and Cloudflare updates the list automatically as new crawlers appear.
Custom WAF Rules
For more granular control, create a custom WAF rule. Go to Security → WAF → Custom rules and create a rule with this expression:
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "Google-Extended") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "FacebookBot") or
(http.user_agent contains "Amazonbot") or
(http.user_agent contains "Diffbot") or
(http.user_agent contains "Cohere-ai")Set the action to Block. This gives you the flexibility to allow some AI bots while blocking others — just remove their line from the expression. The advantage of CDN-level blocking is performance: the request never reaches your origin server, saving bandwidth and compute resources.
How to Verify Your Blocks Are Working
You've added robots.txt rules, server-level blocks, and meta tags. But how do you know they're actually working? Verification requires checking two things: that your rules are syntactically correct, and that blocked bots are actually receiving 403 responses.
Test Your robots.txt
Use Google's robots.txt Tester in Search Console to validate syntax. You can also manually verify by visiting yoursite.com/robots.txt in a browser and confirming your rules appear correctly.
Check Server Access Logs
Your web server access logs record every request, including the user agent string. Search for AI crawler activity and check the response codes:
# Search access logs for AI crawler activity
grep -iE "GPTBot|ClaudeBot|Bytespider|CCBot|PerplexityBot" \
/var/log/nginx/access.log | tail -20
# Check specifically for 403 responses (successful blocks)
grep -iE "GPTBot|ClaudeBot|Bytespider" \
/var/log/nginx/access.log | grep " 403 "
# Count AI bot requests over the last 24 hours
grep -iE "GPTBot|ClaudeBot|Bytespider|CCBot" \
/var/log/nginx/access.log | wc -lIf blocked bots still appear with 200 status codes, your server-level rules are not working correctly. You should see 403 responses if .htaccess or Nginx blocking is active. If you only have robots.txt rules (no server-level blocking), compliant bots simply will not request your pages — they will not appear in logs at all.
For ongoing monitoring without manually digging through log files, Copper Analytics's crawler tracking automates this process. It shows you exactly which bots are visiting, how often, and which pages they target — all in a visual dashboard.
Keep in Mind
Not all AI companies respect robots.txt. Some crawlers use rotating user agent strings or disguise themselves as regular browsers. Monitoring with analytics is the only way to verify compliance and catch bots that slip through your rules.
The Trade-Off: Blocking vs Allowing
Before you block everything, consider whether a blanket block is the right strategy for your site. There are legitimate reasons to allow some AI crawler access:
- AI search visibility: Tools like Perplexity, ChatGPT with browsing, and Google's AI Overviews cite sources. If your content is blocked, you will not appear in AI-generated answers — and your competitors will. For some businesses, AI visibility is becoming as important as traditional SEO.
- Selective blocking: You might want to block training bots (GPTBot, ClaudeBot) while allowing retrieval bots (ChatGPT-User, PerplexityBot) that cite your content with a link back. This gives you exposure without contributing to model training.
- The visibility tradeoff: Every blocked bot is potential traffic you will never see. As AI-powered search grows, being absent from AI results could mean declining organic reach — similar to how ignoring mobile SEO a decade ago cost sites significant traffic.
- Content type matters: A news site with original reporting has stronger reasons to block than a SaaS company whose product pages benefit from broad exposure. Match your blocking strategy to your content's value proposition.
The most pragmatic approach is selective: block training-focused crawlers like GPTBot and Bytespider, but allow retrieval bots that drive referral traffic. Review your AI crawler bandwidth impact regularly and adjust as the landscape evolves.
Monitor Before You Block with Copper Analytics
The smartest approach to AI crawlers is not to block blindly — it is to understand what is happening first. Before you add a single rule to your robots.txt, find out which bots are actually visiting your site, how much bandwidth they consume, and which pages they target most.
Copper Analytics's crawler tracking dashboard identifies AI bots by user agent, tracks their request volume over time, and shows you exactly which pages they are accessing. You can monitor AI crawler activity alongside your regular human traffic — all in one place, without digging through raw server logs.
Once you have data, you can make informed decisions: block the high-volume training bots eating your bandwidth, allow the retrieval bots sending you referral traffic, and verify that your blocking rules are actually working. Data-driven blocking beats guesswork every time.
See pricing plans or start your free account to get crawler tracking set up in under two minutes.
Know Which Bots Are Crawling Your Site
Block what you can. Track what you can't. Copper Analytics shows you every AI crawler hitting your site — in real time.
Get Started Free