How to Block AI Crawlers (GPTBot, ClaudeBot, and Others)
AI training bots are crawling your website right now. Here's how to control which ones get access to your content — and which ones get blocked at the door.
Jump to section
Why You Might Want to Block AI Crawlers
AI companies are crawling the web at unprecedented scale. OpenAI'sGPTBot, Anthropic'sClaudeBot, ByteDance'sBytespider, and dozens of others are scraping website content to train large language models. For site owners, this raises three immediate concerns.
13+
Known AI crawlers
100s
Requests/min (worst)
Zero
Attribution given
4
Blocking methods
AI crawlers are aggressive. Unlike Googlebot, which is designed to crawl politely with rate limits, some AI bots hammer your server with hundreds of requests per minute. Bytespider in particular has earned a reputation for excessive crawling that can spike bandwidth costs and slow down your site for real visitors. If you're on a metered hosting plan, these bots are literally costing you money.
When an AI model trains on your content, your words become part of its output — without attribution, compensation, or consent. Publishers, journalists, and creators have legitimate reasons toblock AI crawlersfrom accessing their work. Whether you view this as a copyright issue or a business decision, the right to opt out is yours.
What AI companies say
Most major AI companies now publish their crawler user agents and commit to respecting<strong>robots.txt disallow rules</strong>. OpenAI, Anthropic, Google, and others provide official opt-out mechanisms.
What actually happens
Not all crawlers respect robots.txt — some use<strong>rotating user agents</strong>or disguise themselves as regular browsers. The EU AI Act reinforces opt-out rights, but enforcement remains inconsistent.
Complete List of AI Crawler User Agents
Before you start blocking, you need to know exactly which bots to target. Here is a comprehensive list of known AI crawler user agents as of early 2026, organized by company:
| User Agent | Company | Purpose |
|---|---|---|
| GPTBot | OpenAI | Model training data collection |
| ChatGPT-User | OpenAI | Real-time browsing for ChatGPT users |
| OAI-SearchBot | OpenAI | SearchGPT web search indexing |
| ClaudeBot | Anthropic | Model training data collection |
| Bytespider | ByteDance | Training data for TikTok/Doubao AI |
| CCBot | Common Crawl | Open dataset used by many AI projects |
| Google-Extended | Gemini and AI Overviews training | |
| Amazonbot | Amazon | Alexa AI and product recommendations |
| FacebookBot | Meta | Llama model training |
| PerplexityBot | Perplexity AI | AI search engine indexing |
| Cohere-ai | Cohere | Enterprise AI model training |
| Diffbot | Diffbot | Knowledge graph and structured data extraction |
| AppleBot-Extended | Apple | Apple Intelligence training |
Important Distinction
<strong>GPTBot</strong>collects training data.<strong>ChatGPT-User</strong>is the real-time browsing agent used when ChatGPT users click “Browse the web.” Similarly,<strong>Google-Extended</strong>is for AI training only — blocking it does<strong>NOT</strong>affect your regular Google Search ranking. These are separate crawlers with separate purposes.
Blocking Methods at a Glance
There are four primary ways toblock AI crawlersfrom your website. Each operates at a different layer, and the strongest protection combines multiple methods:
robots.txt rules
The simplest opt-out. Add<code>Disallow: /</code>directives per bot — respected by compliant crawlers but not enforceable.
.htaccess / Nginx blocking
Server-level enforcement. Returns<code>403 Forbidden</code>before any content is served — bots receive nothing regardless of compliance.
Cloudflare WAF rules
CDN-level blocking. Requests never reach your origin server, saving bandwidth and compute. Cloudflare auto-updates the known bot list.
HTML meta tags (noai, noimageai)
Page-level opt-out signal. Fine-grained control over which pages are excluded from AI training — not yet universally recognized.
Best Practice
Use robots.txt as your first line of defense, add server-level or CDN blocking for enforcement, and include meta tags for page-level control. Layered protection catches crawlers that slip through any single method.
Method 1: robots.txt Rules
The simplest way toblock AI botsfrom your website is through your robots.txt file. This file lives at the root of your domain (e.g., yoursite.com/robots.txt) and tells crawlers which areas they can and cannot access. Add the following rules to block all major AI crawlers:
Tip
Paste these rules at the end of your existing robots.txt file. Keep your existing rules for Googlebot, Bingbot, and other search engine crawlers intact — those are separate and unaffected. If you only want to protect certain directories (e.g., your blog), replace<code>/</code>with a specific path like<code>/blog/</code>.
EachUser-agentdirective targets a specific bot by name, and Disallow: /blocks access to your entire site. The limitation of robots.txt is that it is a request, not enforcement. Well-behaved bots respect it, but nothing technically prevents a rogue crawler from ignoring it entirely. That is why server-level blocking matters.
Method 2: .htaccess and Nginx Server Rules
For stronger protection, block AI crawlers at the server level. This returns a403 Forbiddenresponse before any content is served — the bot receives nothing at all, regardless of whether it respects robots.txt.
If you run Apache, add these rewrite rules to your.htaccessfile:
The[NC]flag makes the match case-insensitive, and[F,L] returns a 403 Forbidden and stops processing. The last condition in the chain omits the [OR]flag to properly close the rule.
For Nginx servers, use amapdirective in your server block. This is more efficient than chaining multipleifstatements because Nginx evaluates it once per request:
The~*prefix makes each pattern case-insensitive. After adding these rules, reload Nginx withsudo nginx -s reload. The advantage of server-level blocking is that the bot receives a 403 immediately — no content is ever transferred.
Bring External Site Data Into Copper
Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.
Method 3: Cloudflare WAF Rules
If your site is behind Cloudflare, you have access to built-in AI bot blocking without touching server configs. Cloudflare added a dedicated toggle in 2024 specifically for AI crawlers.
For more granular control, create a custom WAF rule. Go toSecurity → WAF → Custom rulesand create a rule with this expression:
- Go to your Cloudflare dashboard and select your domain.
- Navigate to<strong>Security → Bots</strong>.
- Enable<strong>“Block AI Scrapers and Crawlers.”</strong>This single toggle blocks all known AI training bots, and Cloudflare updates the list automatically as new crawlers appear.
Set the action toBlock. This gives you the flexibility to allow some AI bots while blocking others — just remove their line from the expression. The advantage of CDN-level blocking is performance: the request never reaches your origin server, saving bandwidth and compute resources.
Setup Comparison: Which Method Is Right for You?
Each blocking method has different setup complexity, enforcement strength, and maintenance requirements. Choose based on your technical comfort level and how aggressively you need to block:
robots.txt
Edit a single text file. No server restart needed. Works on any hosting platform including shared hosting and static sites. Setup: ~2 minutes · Enforcement: advisory
.htaccess / Nginx
Edit server config files. Requires server access and a reload. Returns<code>403 Forbidden</code>before any content is served. Setup: ~10 minutes · Enforcement: hard block
Cloudflare WAF
Toggle a switch or create a custom rule in the Cloudflare dashboard. Blocks at the CDN edge — requests never reach your server. Setup: ~5 minutes · Enforcement: hard block
Meta tags
Add<code><meta></code>tags to individual pages. Per-page control over AI training opt-out signals. Setup: ~5 minutes · Enforcement: advisory
Verdict
For most sites,<strong>robots.txt plus Cloudflare WAF</strong>gives the best protection-to-effort ratio. If you run your own Apache or Nginx server and don't use Cloudflare, server-level rules are essential since robots.txt alone is not enforceable.
How to Verify Your Blocks Are Working
You've added robots.txt rules, server-level blocks, and meta tags. But how do you know they're actually working? Verification requires checking two things: that your rules are syntactically correct, and that blocked bots are actually receiving 403 responses.
Use Google's robots.txt Testerin Search Console to validate syntax. You can also manually verify by visitingyoursite.com/robots.txtin a browser and confirming your rules appear correctly.
Your web server access logs record every request, including the user agent string. Search for AI crawler activity and check the response codes:
If blocked bots still appear with 200 status codes, your server-level rules are not working correctly. You should see 403 responses if .htaccess or Nginx blocking is active. If you only have robots.txt rules (no server-level blocking), compliant bots simply will not request your pages — they will not appear in logs at all.
For ongoing monitoring without manually digging through log files, Copper Analytics's crawler tracking automates this process. It shows you exactly which bots are visiting, how often, and which pages they target — all in a visual dashboard.
Keep in Mind
Not all AI companies respect robots.txt. Some crawlers use rotating user agent strings or disguise themselves as regular browsers. Monitoring with analytics is the only way to verify compliance and catch bots that slip through your rules.
The Trade-Off: Blocking vs Allowing
Before you block everything, consider whether a blanket block is the right strategy for your site. There are legitimate reasons to allow some AI crawler access:
AI search visibility
Tools like Perplexity, ChatGPT with browsing, and Google's AI Overviews cite sources. If your content is blocked, you will not appear in AI-generated answers — and your competitors will.
Selective blocking
Block training bots (GPTBot, ClaudeBot) while allowing retrieval bots (ChatGPT-User, PerplexityBot) that cite your content with a link back. Exposure without contributing to model training.
The visibility tradeoff
Every blocked bot is potential traffic you will never see. As AI-powered search grows, being absent from AI results could mean declining organic reach — similar to how ignoring mobile SEO cost sites significant traffic a decade ago.
Content type matters
A news site with original reporting has stronger reasons to block than a SaaS company whose product pages benefit from broad exposure. Match your blocking strategy to your content's value proposition.
Verdict
The most pragmatic approach is<strong>selective blocking</strong>: block training-focused crawlers like GPTBot and Bytespider, but allow retrieval bots that drive referral traffic. Review your <a href="/blog/ai-crawlers-website-impact">AI crawler bandwidth impact</a> regularly and adjust as the landscape evolves.
Monitor Before You Block withCopper Analytics
The smartest approach to AI crawlers is not to block blindly — it is to understand what is happening first. Before you add a single rule to your robots.txt, find out which bots are actually visiting your site, how much bandwidth they consume, and which pages they target most.
Copper Analytics's crawler tracking dashboard identifies AI bots by user agent, tracks their request volume over time, and shows you exactly which pages they are accessing. You can monitor AI crawler activity alongside your regular human traffic — all in one place, without digging through raw server logs.
Bot identification
See every AI crawler by name — GPTBot, ClaudeBot, Bytespider, and more — with request counts and page targets.
Block verification
Confirm your robots.txt and server rules are working. If a blocked bot still appears, you know your rules need fixing.
Bandwidth insights
Quantify how much traffic AI crawlers consume so you can make data-driven blocking decisions.
Free tier included
Get started without a credit card. Crawler tracking is available on all plans, including the permanent free tier.
Once you have data, you can make informed decisions: block the high-volume training bots eating your bandwidth, allow the retrieval bots sending you referral traffic, and verify that your blocking rules are actually working. Data-driven blocking beats guesswork every time.
Block with confidence
Use robots.txt for compliant crawlers, server-level rules or Cloudflare WAF for enforcement, and meta tags for page-level control. Layer all four methods for the strongest protection against unauthorized AI training.
Monitor withCopper Analytics
See <a href="/pricing">pricing plans</a> or <a href="/register">start your free account</a> to get crawler tracking set up in under two minutes. Know exactly which bots are hitting your site — and whether your blocks are working.
Know Which Bots Are Crawling Your Site
Block what you can. Track what you can't.Copper Analyticsshows you every AI crawler hitting your site — in real time.
What to Do Next
The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.
You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.