← Back to Blog·March 5, 2026·9 min read
Technical

How to Block AI Crawlers (GPTBot, ClaudeBot, and Others)

AI training bots are crawling your website right now. Here's how to control which ones get access to your content — and which ones get blocked at the door.

Server blocking AI crawler bots with a digital shield illustration

At a Glance

  • AI crawlers like GPTBot, ClaudeBot, and Bytespider scrape your content for model training — often without asking.
  • You can block AI bots using robots.txt, .htaccess, Nginx configs, Cloudflare WAF, or HTML meta tags.
  • Blocking Google-Extended prevents AI feature inclusion but does not affect regular search indexing.
  • Not all AI companies respect robots.txt — server-level blocking and analytics monitoring are essential fallbacks.
  • The noai and noimageai meta tags provide an additional opt-out signal at the page level.
  • Copper Analytics lets you track which AI crawlers are hitting your site so you can verify your blocks are working.

Why You Might Want to Block AI Crawlers

AI companies are crawling the web at unprecedented scale. OpenAI's GPTBot, Anthropic's ClaudeBot, ByteDance's Bytespider, and dozens of others are scraping website content to train large language models. For site owners, this raises three immediate concerns.

Bandwidth and Server Load

AI crawlers are aggressive. Unlike Googlebot, which is designed to crawl politely with rate limits, some AI bots hammer your server with hundreds of requests per minute. Bytespider in particular has earned a reputation for excessive crawling that can spike bandwidth costs and slow down your site for real visitors. If you're on a metered hosting plan, these bots are literally costing you money.

Content Licensing and IP Protection

When an AI model trains on your content, your words become part of its output — without attribution, compensation, or consent. Publishers, journalists, and creators have legitimate reasons to block AI crawlers from accessing their work. Whether you view this as a copyright issue or a business decision, the right to opt out is yours.

Your Right to Opt Out

Most major AI companies now offer opt-out mechanisms via robots.txt. OpenAI, Anthropic, Google, and others have published their crawler user agents and committed to respecting disallow rules. The EU AI Act and similar regulations are reinforcing the principle that content owners should control how their data is used for AI training.

Complete List of AI Crawler User Agents

Before you start blocking, you need to know exactly which bots to target. Here is a comprehensive list of known AI crawler user agents as of early 2026, organized by company:

User AgentCompanyPurpose
GPTBotOpenAIModel training data collection
ChatGPT-UserOpenAIReal-time browsing for ChatGPT users
OAI-SearchBotOpenAISearchGPT web search indexing
ClaudeBotAnthropicModel training data collection
BytespiderByteDanceTraining data for TikTok/Doubao AI
CCBotCommon CrawlOpen dataset used by many AI projects
Google-ExtendedGoogleGemini and AI Overviews training
AmazonbotAmazonAlexa AI and product recommendations
FacebookBotMetaLlama model training
PerplexityBotPerplexity AIAI search engine indexing
Cohere-aiCohereEnterprise AI model training
DiffbotDiffbotKnowledge graph and structured data extraction
AppleBot-ExtendedAppleApple Intelligence training

Important Distinction

GPTBot collects training data. ChatGPT-User is the real-time browsing agent used when ChatGPT users click “Browse the web.” Similarly, Google-Extended is for AI training only — blocking it does NOT affect your regular Google Search ranking. These are separate crawlers with separate purposes.

Method 1: robots.txt Rules

The simplest way to block AI bots from your website is through your robots.txt file. This file lives at the root of your domain (e.g., yoursite.com/robots.txt) and tells crawlers which areas they can and cannot access. Add the following rules to block all major AI crawlers:

robots.txt
# Block OpenAI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

# Block Anthropic crawler
User-agent: ClaudeBot
Disallow: /

# Block ByteDance crawler
User-agent: Bytespider
Disallow: /

# Block Common Crawl (used by many AI projects)
User-agent: CCBot
Disallow: /

# Block Google AI training (not regular search)
User-agent: Google-Extended
Disallow: /

# Block Amazon AI crawler
User-agent: Amazonbot
Disallow: /

# Block Meta/Facebook AI crawler
User-agent: FacebookBot
Disallow: /

# Block Perplexity AI crawler
User-agent: PerplexityBot
Disallow: /

# Block Cohere AI crawler
User-agent: Cohere-ai
Disallow: /

# Block Diffbot crawler
User-agent: Diffbot
Disallow: /

# Block Apple Intelligence training
User-agent: AppleBot-Extended
Disallow: /

Tip

Paste these rules at the end of your existing robots.txt file. Keep your existing rules for Googlebot, Bingbot, and other search engine crawlers intact — those are separate and unaffected. If you only want to protect certain directories (e.g., your blog), replace / with a specific path like /blog/.

Each User-agent directive targets a specific bot by name, and Disallow: / blocks access to your entire site. The limitation of robots.txt is that it is a request, not enforcement. Well-behaved bots respect it, but nothing technically prevents a rogue crawler from ignoring it entirely. That is why server-level blocking matters.

Method 2: .htaccess and Nginx Server Rules

For stronger protection, block AI crawlers at the server level. This returns a 403 Forbidden response before any content is served — the bot receives nothing at all, regardless of whether it respects robots.txt.

Apache .htaccess

If you run Apache, add these rewrite rules to your .htaccess file:

.htaccess
# Block AI crawlers at the server level
RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChatGPT-User [NC,OR]
RewriteCond %{HTTP_USER_AGENT} OAI-SearchBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Amazonbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FacebookBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PerplexityBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Cohere-ai [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Diffbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} AppleBot-Extended [NC]
RewriteRule .* - [F,L]

The [NC] flag makes the match case-insensitive, and [F,L] returns a 403 Forbidden and stops processing. The last condition in the chain omits the [OR] flag to properly close the rule.

Nginx Configuration

For Nginx servers, use a map directive in your server block. This is more efficient than chaining multiple if statements because Nginx evaluates it once per request:

nginx.conf
# Block AI crawlers in Nginx
map $http_user_agent $is_ai_bot {
    default 0;
    ~*GPTBot 1;
    ~*ChatGPT-User 1;
    ~*OAI-SearchBot 1;
    ~*ClaudeBot 1;
    ~*Bytespider 1;
    ~*CCBot 1;
    ~*Google-Extended 1;
    ~*Amazonbot 1;
    ~*FacebookBot 1;
    ~*PerplexityBot 1;
    ~*Cohere-ai 1;
    ~*Diffbot 1;
    ~*AppleBot-Extended 1;
}

server {
    # ... your existing config ...

    if ($is_ai_bot) {
        return 403;
    }
}

The ~* prefix makes each pattern case-insensitive. After adding these rules, reload Nginx with sudo nginx -s reload. The advantage of server-level blocking is that the bot receives a 403 immediately — no content is ever transferred.

Method 3: Cloudflare WAF Rules

If your site is behind Cloudflare, you have access to built-in AI bot blocking without touching server configs. Cloudflare added a dedicated toggle in 2024 specifically for AI crawlers.

One-Click Blocking

  1. Go to your Cloudflare dashboard and select your domain.
  2. Navigate to Security → Bots.
  3. Enable “Block AI Scrapers and Crawlers.” This single toggle blocks all known AI training bots, and Cloudflare updates the list automatically as new crawlers appear.

Custom WAF Rules

For more granular control, create a custom WAF rule. Go to Security → WAF → Custom rules and create a rule with this expression:

Cloudflare WAF Expression
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "Google-Extended") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "FacebookBot") or
(http.user_agent contains "Amazonbot") or
(http.user_agent contains "Diffbot") or
(http.user_agent contains "Cohere-ai")

Set the action to Block. This gives you the flexibility to allow some AI bots while blocking others — just remove their line from the expression. The advantage of CDN-level blocking is performance: the request never reaches your origin server, saving bandwidth and compute resources.

Method 4: Meta Tags (noai, noimageai)

In addition to robots.txt and server rules, you can add HTML meta tags that signal to AI crawlers that your content should not be used for training. These tags work at the page level, giving you fine-grained control over which pages opt out.

The robots noai Tag

Add this meta tag to the <head> section of any page you want to protect from AI training:

HTML
<!-- Block AI training on text content -->
<meta name="robots" content="noai" />

<!-- Block AI training on images specifically -->
<meta name="robots" content="noimageai" />

<!-- Combine both directives -->
<meta name="robots" content="noai, noimageai" />

<!-- Target specific bots -->
<meta name="GPTBot" content="noindex, nofollow" />
<meta name="ClaudeBot" content="noindex, nofollow" />

The noai directive tells AI crawlers not to use the page's text content for training. The noimageai directive specifically protects images from being used in image generation training datasets. Google introduced these directives alongside Google-Extended, and other AI companies have begun recognizing them.

You can also use bot-specific meta tags to target individual crawlers. Setting noindex, nofollow for a specific bot name prevents that bot from indexing the page or following its links.

Adoption Status

The noai and noimageai directives are not yet a universal standard. Google and a few other companies recognize them, but compliance varies. Use meta tags as an additional layer alongside robots.txt and server-level rules — not as your only defense.

How to Verify Your Blocks Are Working

You've added robots.txt rules, server-level blocks, and meta tags. But how do you know they're actually working? Verification requires checking two things: that your rules are syntactically correct, and that blocked bots are actually receiving 403 responses.

Test Your robots.txt

Use Google's robots.txt Tester in Search Console to validate syntax. You can also manually verify by visiting yoursite.com/robots.txt in a browser and confirming your rules appear correctly.

Check Server Access Logs

Your web server access logs record every request, including the user agent string. Search for AI crawler activity and check the response codes:

bash
# Search access logs for AI crawler activity
grep -iE "GPTBot|ClaudeBot|Bytespider|CCBot|PerplexityBot" \
  /var/log/nginx/access.log | tail -20

# Check specifically for 403 responses (successful blocks)
grep -iE "GPTBot|ClaudeBot|Bytespider" \
  /var/log/nginx/access.log | grep " 403 "

# Count AI bot requests over the last 24 hours
grep -iE "GPTBot|ClaudeBot|Bytespider|CCBot" \
  /var/log/nginx/access.log | wc -l

If blocked bots still appear with 200 status codes, your server-level rules are not working correctly. You should see 403 responses if .htaccess or Nginx blocking is active. If you only have robots.txt rules (no server-level blocking), compliant bots simply will not request your pages — they will not appear in logs at all.

For ongoing monitoring without manually digging through log files, Copper Analytics's crawler tracking automates this process. It shows you exactly which bots are visiting, how often, and which pages they target — all in a visual dashboard.

Keep in Mind

Not all AI companies respect robots.txt. Some crawlers use rotating user agent strings or disguise themselves as regular browsers. Monitoring with analytics is the only way to verify compliance and catch bots that slip through your rules.

The Trade-Off: Blocking vs Allowing

Before you block everything, consider whether a blanket block is the right strategy for your site. There are legitimate reasons to allow some AI crawler access:

  • AI search visibility: Tools like Perplexity, ChatGPT with browsing, and Google's AI Overviews cite sources. If your content is blocked, you will not appear in AI-generated answers — and your competitors will. For some businesses, AI visibility is becoming as important as traditional SEO.
  • Selective blocking: You might want to block training bots (GPTBot, ClaudeBot) while allowing retrieval bots (ChatGPT-User, PerplexityBot) that cite your content with a link back. This gives you exposure without contributing to model training.
  • The visibility tradeoff: Every blocked bot is potential traffic you will never see. As AI-powered search grows, being absent from AI results could mean declining organic reach — similar to how ignoring mobile SEO a decade ago cost sites significant traffic.
  • Content type matters: A news site with original reporting has stronger reasons to block than a SaaS company whose product pages benefit from broad exposure. Match your blocking strategy to your content's value proposition.

The most pragmatic approach is selective: block training-focused crawlers like GPTBot and Bytespider, but allow retrieval bots that drive referral traffic. Review your AI crawler bandwidth impact regularly and adjust as the landscape evolves.

Monitor Before You Block with Copper Analytics

The smartest approach to AI crawlers is not to block blindly — it is to understand what is happening first. Before you add a single rule to your robots.txt, find out which bots are actually visiting your site, how much bandwidth they consume, and which pages they target most.

Copper Analytics's crawler tracking dashboard identifies AI bots by user agent, tracks their request volume over time, and shows you exactly which pages they are accessing. You can monitor AI crawler activity alongside your regular human traffic — all in one place, without digging through raw server logs.

Once you have data, you can make informed decisions: block the high-volume training bots eating your bandwidth, allow the retrieval bots sending you referral traffic, and verify that your blocking rules are actually working. Data-driven blocking beats guesswork every time.

See pricing plans or start your free account to get crawler tracking set up in under two minutes.

Know Which Bots Are Crawling Your Site

Block what you can. Track what you can't. Copper Analytics shows you every AI crawler hitting your site — in real time.

Get Started Free