← Back to Blog·Nov 12, 2024·14 min read

AI Crawlers

AI Crawler Blocking Guide: Every Method Explained

From robots.txt to Cloudflare WAF rules, this is the definitive reference for blocking GPTBot, ClaudeBot, Bytespider, and every other AI crawler hitting your website.

The definitive step-by-step guide to blocking AI crawlers from your website

robots.txt, server configs, WAF rules, and CDN-level blocking — all in one place

Jump to section

Why You Need a Complete AI Crawler Blocking Guide

AI crawlers are not like traditional search engine bots. Googlebot and Bingbot index your pages so users can find you through search results — there is a clear value exchange. AI training crawlers like GPTBot, ClaudeBot, Bytespider, and CCBot scrape your content to feed large language models, often providing no referral traffic in return.

The problem has grown dramatically. AI bot traffic increased over 300% between 2023 and 2025, and for content-heavy sites these crawlers can account for 30 to 50 percent of total bandwidth. Blocking them is not just a preference — it is a practical necessity for controlling server costs and protecting your intellectual property.

This ai crawler blocking guide is designed as a single, comprehensive reference. Rather than covering one method in isolation, we walk through every available blocking technique — from the simplest robots.txt directive to advanced CDN-level firewall rules — so you can layer defenses and verify they actually work.

GPTBot (OpenAI) — One of the most aggressive AI crawlers by request volume
ClaudeBot (Anthropic) — Respects robots.txt but crawls frequently
Bytespider (ByteDance/TikTok) — Extremely high request rates, often ignores rate limits
Google-Extended — Google's AI training crawler, separate from Googlebot
PerplexityBot — Perplexity AI's content retrieval crawler
CCBot (Common Crawl) — Feeds open datasets used by many AI companies
Meta-ExternalAgent — Meta's AI training crawler
amazonbot — Amazon's crawler for Alexa and AI training

Important Distinction

Blocking AI crawlers is different from blocking search engine crawlers. Do not accidentally block Googlebot or Bingbot — those drive your organic search traffic. This guide focuses exclusively on AI training and retrieval bots.

Step 1: Block AI Crawlers with robots.txt Directives

The <code>robots.txt</code> file is the universal starting point for any ai crawler blocking guide. It sits at the root of your domain and tells well-behaved bots which paths they are allowed to crawl. Every major AI company has committed to respecting robots.txt — though compliance varies in practice.

To block all major AI crawlers, add the following directives to your <code>robots.txt</code> file. Each <code>User-agent</code> line targets a specific bot, and <code>Disallow: /</code> blocks access to your entire site.

Place this file at <code>https://yourdomain.com/robots.txt</code>. If you already have a robots.txt file, append these directives below your existing rules for Googlebot and other search engine crawlers.

Open or create the <code>robots.txt</code> file in your site's root directory (public folder for most frameworks)
Add <code>User-agent: GPTBot</code> followed by <code>Disallow: /</code> on the next line — repeat for each AI crawler: ClaudeBot, Bytespider, Google-Extended, PerplexityBot, CCBot, Meta-ExternalAgent, amazonbot
Verify the file is accessible by visiting <code>https://yourdomain.com/robots.txt</code> in your browser
Test with Google's robots.txt tester or a tool like <code>curl -A "GPTBot" https://yourdomain.com/robots.txt</code>

Step 2: Enforce Blocks at the Server Level with Nginx

The critical limitation of robots.txt is that it is advisory — not enforced. A bot that ignores your robots.txt can still crawl your entire site. Server-level blocking with Nginx returns a <code>403 Forbidden</code> response to AI crawlers, making it impossible for them to retrieve your content regardless of whether they respect robots.txt.

Nginx's <code>if</code> directive with <code>$http_user_agent</code> matching is the most common approach. You add a block inside your <code>server</code> configuration that checks the User-Agent header against known AI crawler strings and returns a 403 if there is a match.

This method is highly performant because Nginx evaluates the condition before any backend processing occurs. The request is rejected at the web server layer, so your application never even sees the traffic. For high-traffic sites receiving thousands of AI crawler requests per hour, this alone can noticeably reduce server load.

Open your Nginx site configuration file — typically at <code>/etc/nginx/sites-available/yourdomain.conf</code> or <code>/etc/nginx/conf.d/yourdomain.conf</code>
Inside the <code>server { }</code> block, add an <code>if ($http_user_agent ~* "(GPTBot|ClaudeBot|Bytespider|Google-Extended|PerplexityBot|CCBot|Meta-ExternalAgent|amazonbot)") { return 403; }</code> directive
Test the configuration with <code>sudo nginx -t</code> to ensure there are no syntax errors
Reload Nginx with <code>sudo systemctl reload nginx</code> to apply the changes
Verify by sending a test request: <code>curl -I -A "GPTBot/1.0" https://yourdomain.com/</code> — you should see a 403 status code

Pro Tip

Use a map block instead of nested if statements for better performance when blocking many user agents. Define a $block_bot variable in an Nginx map and reference it in your server block — this is more efficient and easier to maintain as new AI crawlers appear.

Bring External Site Data Into Copper

Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.

Get Started Free View Pricing

Step 3: Block AI Crawlers with Apache and .htaccess

If your site runs on Apache — or you are on shared hosting where you cannot edit the main server configuration — <code>.htaccess</code> rules provide an equivalent server-level block. Like Nginx rules, these enforce the block at the HTTP layer, returning a 403 before your application processes the request.

Apache uses <code>mod_rewrite</code> or <code>mod_setenvif</code> to inspect the User-Agent header. The <code>mod_rewrite</code> approach is more common because most shared hosting providers have it enabled by default. You add <code>RewriteCond</code> directives that match AI crawler user agents and a <code>RewriteRule</code> that returns a 403 Forbidden response.

For sites running Apache directly (not behind a reverse proxy), this is often the simplest blocking method available. Create or edit the <code>.htaccess</code> file in your document root and add the rewrite conditions for each AI crawler you want to block.

mod_rewrite method: Use <code>RewriteCond %{HTTP_USER_AGENT}</code> with a regex matching GPTBot|ClaudeBot|Bytespider and other AI bots, followed by <code>RewriteRule .* - [F,L]</code>
mod_setenvif method: Use <code>SetEnvIfNoCase User-Agent "GPTBot" bad_bot</code> for each crawler, then <code>Deny from env=bad_bot</code> inside an <code>Order Allow,Deny</code> block
Virtual host method: If you have access to the Apache virtual host config, place the rules there instead of .htaccess for better performance
Shared hosting note: Most cPanel and Plesk hosts support .htaccess rules — check that <code>AllowOverride All</code> is enabled for your directory

Step 4: Block AI Crawlers at the Edge with Cloudflare WAF and CDN Rules

CDN-level blocking is the most powerful layer in your ai crawler blocking guide because it stops requests at the network edge — before they ever reach your origin server. Cloudflare, Fastly, AWS CloudFront, and other CDN providers offer Web Application Firewall (WAF) rules that can match on user-agent strings, IP ranges, ASN numbers, and even behavioral patterns.

Cloudflare is the most widely used option for AI crawler blocking. Their free tier includes five custom WAF rules, which is enough to cover the major AI crawlers. You create a rule that matches on the <code>User-Agent</code> header containing any of the known AI crawler strings and set the action to Block or Managed Challenge.

The advantage of CDN-level blocking over server-level blocking is scale and intelligence. Cloudflare maintains its own bot intelligence database and can identify AI crawlers even when they rotate user-agent strings or crawl from residential IP addresses. Their "AI Scrapers and Crawlers" managed ruleset can be enabled with a single toggle and covers dozens of known AI bots.

If you use a different CDN provider, the principle is the same: create a WAF rule or edge function that inspects the User-Agent header and blocks requests matching AI crawler patterns. AWS WAF, Fastly VCL, and Vercel Edge Middleware all support this pattern.

Log in to your Cloudflare dashboard and select your domain
Navigate to Security > WAF > Custom rules and click Create rule
Set the expression to match User-Agent containing GPTBot, ClaudeBot, Bytespider, Google-Extended, PerplexityBot, CCBot, Meta-ExternalAgent, or amazonbot using the <code>OR</code> operator
Set the action to Block (returns 403) or Managed Challenge (presents a CAPTCHA that bots cannot solve)
Alternatively, go to Security > Bots and enable the AI Scrapers and Crawlers toggle to use Cloudflare's managed ruleset

Building a Layered AI Crawler Blocking Strategy

No single blocking method is sufficient on its own. Robots.txt is widely respected but unenforceable. Server-level blocks catch bots that ignore robots.txt but only work if the request reaches your origin. CDN rules stop bots at the edge but require a CDN subscription. The most effective approach layers all three methods together.

Start with robots.txt as your public declaration of intent — this establishes your legal position and blocks compliant crawlers. Add server-level rules (Nginx or Apache) as your enforcement layer for bots that reach your origin despite CDN protections. Finally, deploy CDN-level WAF rules as your outermost defense, stopping the majority of AI crawler traffic before it consumes any of your server resources.

You should also consider rate limiting as a complementary strategy. Rather than outright blocking certain crawlers, you can throttle them to a small number of requests per minute. This is useful for crawlers like Google-Extended where you might want to allow limited access while preventing the aggressive crawl rates that spike your bandwidth.

Layer 1 — robots.txt: Blocks compliant crawlers and establishes your legal position on AI scraping
Layer 2 — Server config (Nginx/Apache/.htaccess): Hard enforcement returning 403 for known AI user agents
Layer 3 — CDN/WAF rules (Cloudflare, AWS WAF, etc.): Edge-level blocking with bot intelligence and behavioral analysis
Layer 4 — Rate limiting: Throttle rather than block for crawlers you want to allow at reduced volume
Layer 5 — Monitoring and verification: Confirm blocks are working and detect new crawlers that bypass your rules

Defense in Depth

Think of AI crawler blocking like network security — you want defense in depth. Even if one layer fails (a bot ignores robots.txt) or is bypassed (traffic bypasses your CDN), the next layer catches it. Update your rules quarterly as new AI crawlers emerge.

Step 5: Verify Your AI Crawler Blocks Are Actually Working

The most common mistake in any ai crawler blocking guide is assuming the blocks work without verification. You add robots.txt rules, update your Nginx config, and move on — only to discover weeks later that Bytespider has been ignoring your robots.txt the entire time, or that a new AI crawler you did not account for is hammering your site.

Verification requires monitoring your actual traffic data. Check your server access logs for AI crawler user-agent strings after deploying your blocks. Use <code>grep</code> or a log analysis tool to search for GPTBot, ClaudeBot, Bytespider, and the other user agents you blocked. If you still see 200 status codes for these bots, your blocks are not working correctly.

<a href="/features/crawler-tracking">Copper Analytics</a> provides a dedicated crawler tracking dashboard that identifies AI bots by user agent, tracks their request volume over time, and shows you exactly which pages they are accessing. After deploying your blocking rules, you can monitor the dashboard to confirm that AI crawler traffic drops to zero — or shifts from 200 responses to 403 responses, depending on which layer caught them.

Set up a recurring check — monthly at minimum — to review your AI crawler traffic. New bots appear regularly, and existing bots update their user-agent strings. A blocking strategy that worked six months ago may have gaps today. Copper Analytics makes this easy by alerting you when new crawler user agents appear on your site.

What to Do Next

The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.

You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.