← Back to Blog·Mar 16, 2026·9 min read
Technical

How to Block AI Crawlers (GPTBot, ClaudeBot, and Others)

AI training bots are crawling your website right now. Here's how to control which ones get access to your content — and which ones get blocked at the door.

Server blocking AI crawler bots with a digital shield illustration

Why You Might Want to Block AI Crawlers

AI companies are crawling the web at unprecedented scale. OpenAI'sGPTBot, Anthropic'sClaudeBot, ByteDance'sBytespider, and dozens of others are scraping website content to train large language models. For site owners, this raises three immediate concerns.

13+

Known AI crawlers

100s

Requests/min (worst)

Zero

Attribution given

4

Blocking methods

AI crawlers are aggressive. Unlike Googlebot, which is designed to crawl politely with rate limits, some AI bots hammer your server with hundreds of requests per minute. Bytespider in particular has earned a reputation for excessive crawling that can spike bandwidth costs and slow down your site for real visitors. If you're on a metered hosting plan, these bots are literally costing you money.

When an AI model trains on your content, your words become part of its output — without attribution, compensation, or consent. Publishers, journalists, and creators have legitimate reasons toblock AI crawlersfrom accessing their work. Whether you view this as a copyright issue or a business decision, the right to opt out is yours.

What AI companies say

Most major AI companies now publish their crawler user agents and commit to respecting<strong>robots.txt disallow rules</strong>. OpenAI, Anthropic, Google, and others provide official opt-out mechanisms.

What actually happens

Not all crawlers respect robots.txt — some use<strong>rotating user agents</strong>or disguise themselves as regular browsers. The EU AI Act reinforces opt-out rights, but enforcement remains inconsistent.

Complete List of AI Crawler User Agents

Before you start blocking, you need to know exactly which bots to target. Here is a comprehensive list of known AI crawler user agents as of early 2026, organized by company:

User AgentCompanyPurpose
GPTBotOpenAIModel training data collection
ChatGPT-UserOpenAIReal-time browsing for ChatGPT users
OAI-SearchBotOpenAISearchGPT web search indexing
ClaudeBotAnthropicModel training data collection
BytespiderByteDanceTraining data for TikTok/Doubao AI
CCBotCommon CrawlOpen dataset used by many AI projects
Google-ExtendedGoogleGemini and AI Overviews training
AmazonbotAmazonAlexa AI and product recommendations
FacebookBotMetaLlama model training
PerplexityBotPerplexity AIAI search engine indexing
Cohere-aiCohereEnterprise AI model training
DiffbotDiffbotKnowledge graph and structured data extraction
AppleBot-ExtendedAppleApple Intelligence training

Important Distinction

<strong>GPTBot</strong>collects training data.<strong>ChatGPT-User</strong>is the real-time browsing agent used when ChatGPT users click “Browse the web.” Similarly,<strong>Google-Extended</strong>is for AI training only — blocking it does<strong>NOT</strong>affect your regular Google Search ranking. These are separate crawlers with separate purposes.

Blocking Methods at a Glance

There are four primary ways toblock AI crawlersfrom your website. Each operates at a different layer, and the strongest protection combines multiple methods:

robots.txt rules

The simplest opt-out. Add<code>Disallow: /</code>directives per bot — respected by compliant crawlers but not enforceable.

.htaccess / Nginx blocking

Server-level enforcement. Returns<code>403 Forbidden</code>before any content is served — bots receive nothing regardless of compliance.

Cloudflare WAF rules

CDN-level blocking. Requests never reach your origin server, saving bandwidth and compute. Cloudflare auto-updates the known bot list.

HTML meta tags (noai, noimageai)

Page-level opt-out signal. Fine-grained control over which pages are excluded from AI training — not yet universally recognized.

Best Practice

Use robots.txt as your first line of defense, add server-level or CDN blocking for enforcement, and include meta tags for page-level control. Layered protection catches crawlers that slip through any single method.

Method 1: robots.txt Rules

The simplest way toblock AI botsfrom your website is through your robots.txt file. This file lives at the root of your domain (e.g., yoursite.com/robots.txt) and tells crawlers which areas they can and cannot access. Add the following rules to block all major AI crawlers:

Tip

Paste these rules at the end of your existing robots.txt file. Keep your existing rules for Googlebot, Bingbot, and other search engine crawlers intact — those are separate and unaffected. If you only want to protect certain directories (e.g., your blog), replace<code>/</code>with a specific path like<code>/blog/</code>.

EachUser-agentdirective targets a specific bot by name, and Disallow: /blocks access to your entire site. The limitation of robots.txt is that it is a request, not enforcement. Well-behaved bots respect it, but nothing technically prevents a rogue crawler from ignoring it entirely. That is why server-level blocking matters.

Method 2: .htaccess and Nginx Server Rules

For stronger protection, block AI crawlers at the server level. This returns a403 Forbiddenresponse before any content is served — the bot receives nothing at all, regardless of whether it respects robots.txt.

If you run Apache, add these rewrite rules to your.htaccessfile:

The[NC]flag makes the match case-insensitive, and[F,L] returns a 403 Forbidden and stops processing. The last condition in the chain omits the [OR]flag to properly close the rule.

For Nginx servers, use amapdirective in your server block. This is more efficient than chaining multipleifstatements because Nginx evaluates it once per request:

The~*prefix makes each pattern case-insensitive. After adding these rules, reload Nginx withsudo nginx -s reload. The advantage of server-level blocking is that the bot receives a 403 immediately — no content is ever transferred.

Bring External Site Data Into Copper

Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.

Method 3: Cloudflare WAF Rules

If your site is behind Cloudflare, you have access to built-in AI bot blocking without touching server configs. Cloudflare added a dedicated toggle in 2024 specifically for AI crawlers.

For more granular control, create a custom WAF rule. Go toSecurity → WAF → Custom rulesand create a rule with this expression:

  1. Go to your Cloudflare dashboard and select your domain.
  2. Navigate to<strong>Security → Bots</strong>.
  3. Enable<strong>“Block AI Scrapers and Crawlers.”</strong>This single toggle blocks all known AI training bots, and Cloudflare updates the list automatically as new crawlers appear.

Set the action toBlock. This gives you the flexibility to allow some AI bots while blocking others — just remove their line from the expression. The advantage of CDN-level blocking is performance: the request never reaches your origin server, saving bandwidth and compute resources.

Method 4: Meta Tags (noai, noimageai)

In addition to robots.txt and server rules, you can add HTML meta tags that signal to AI crawlers that your content should not be used for training. These tags work at the page level, giving you fine-grained control over which pages opt out.

Add this meta tag to thesection of any page you want to protect from AI training:

Thenoaidirective tells AI crawlers not to use the page's text content for training. Thenoimageaidirective specifically protects images from being used in image generation training datasets. Google introduced these directives alongside Google-Extended, and other AI companies have begun recognizing them.

You can also use bot-specific meta tags to target individual crawlers. Setting noindex, nofollowfor a specific bot name prevents that bot from indexing the page or following its links.

Adoption Status

The<code>noai</code>and<code>noimageai</code>directives are not yet a universal standard. Google and a few other companies recognize them, but compliance varies. Use meta tags as an additional layer alongside robots.txt and server-level rules — not as your only defense.

Setup Comparison: Which Method Is Right for You?

Each blocking method has different setup complexity, enforcement strength, and maintenance requirements. Choose based on your technical comfort level and how aggressively you need to block:

robots.txt

Edit a single text file. No server restart needed. Works on any hosting platform including shared hosting and static sites. Setup: ~2 minutes · Enforcement: advisory

.htaccess / Nginx

Edit server config files. Requires server access and a reload. Returns<code>403 Forbidden</code>before any content is served. Setup: ~10 minutes · Enforcement: hard block

Cloudflare WAF

Toggle a switch or create a custom rule in the Cloudflare dashboard. Blocks at the CDN edge — requests never reach your server. Setup: ~5 minutes · Enforcement: hard block

Meta tags

Add<code><meta></code>tags to individual pages. Per-page control over AI training opt-out signals. Setup: ~5 minutes · Enforcement: advisory

Verdict

For most sites,<strong>robots.txt plus Cloudflare WAF</strong>gives the best protection-to-effort ratio. If you run your own Apache or Nginx server and don't use Cloudflare, server-level rules are essential since robots.txt alone is not enforceable.

How to Verify Your Blocks Are Working

You've added robots.txt rules, server-level blocks, and meta tags. But how do you know they're actually working? Verification requires checking two things: that your rules are syntactically correct, and that blocked bots are actually receiving 403 responses.

Use Google's robots.txt Testerin Search Console to validate syntax. You can also manually verify by visitingyoursite.com/robots.txtin a browser and confirming your rules appear correctly.

Your web server access logs record every request, including the user agent string. Search for AI crawler activity and check the response codes:

If blocked bots still appear with 200 status codes, your server-level rules are not working correctly. You should see 403 responses if .htaccess or Nginx blocking is active. If you only have robots.txt rules (no server-level blocking), compliant bots simply will not request your pages — they will not appear in logs at all.

For ongoing monitoring without manually digging through log files, Copper Analytics's crawler tracking automates this process. It shows you exactly which bots are visiting, how often, and which pages they target — all in a visual dashboard.

Keep in Mind

Not all AI companies respect robots.txt. Some crawlers use rotating user agent strings or disguise themselves as regular browsers. Monitoring with analytics is the only way to verify compliance and catch bots that slip through your rules.

The Trade-Off: Blocking vs Allowing

Before you block everything, consider whether a blanket block is the right strategy for your site. There are legitimate reasons to allow some AI crawler access:

AI search visibility

Tools like Perplexity, ChatGPT with browsing, and Google's AI Overviews cite sources. If your content is blocked, you will not appear in AI-generated answers — and your competitors will.

Selective blocking

Block training bots (GPTBot, ClaudeBot) while allowing retrieval bots (ChatGPT-User, PerplexityBot) that cite your content with a link back. Exposure without contributing to model training.

The visibility tradeoff

Every blocked bot is potential traffic you will never see. As AI-powered search grows, being absent from AI results could mean declining organic reach — similar to how ignoring mobile SEO cost sites significant traffic a decade ago.

Content type matters

A news site with original reporting has stronger reasons to block than a SaaS company whose product pages benefit from broad exposure. Match your blocking strategy to your content's value proposition.

Verdict

The most pragmatic approach is<strong>selective blocking</strong>: block training-focused crawlers like GPTBot and Bytespider, but allow retrieval bots that drive referral traffic. Review your <a href="/blog/ai-crawlers-website-impact">AI crawler bandwidth impact</a> regularly and adjust as the landscape evolves.

Monitor Before You Block withCopper Analytics

The smartest approach to AI crawlers is not to block blindly — it is to understand what is happening first. Before you add a single rule to your robots.txt, find out which bots are actually visiting your site, how much bandwidth they consume, and which pages they target most.

Copper Analytics's crawler tracking dashboard identifies AI bots by user agent, tracks their request volume over time, and shows you exactly which pages they are accessing. You can monitor AI crawler activity alongside your regular human traffic — all in one place, without digging through raw server logs.

Bot identification

See every AI crawler by name — GPTBot, ClaudeBot, Bytespider, and more — with request counts and page targets.

Block verification

Confirm your robots.txt and server rules are working. If a blocked bot still appears, you know your rules need fixing.

Bandwidth insights

Quantify how much traffic AI crawlers consume so you can make data-driven blocking decisions.

Free tier included

Get started without a credit card. Crawler tracking is available on all plans, including the permanent free tier.

Once you have data, you can make informed decisions: block the high-volume training bots eating your bandwidth, allow the retrieval bots sending you referral traffic, and verify that your blocking rules are actually working. Data-driven blocking beats guesswork every time.

Block with confidence

Use robots.txt for compliant crawlers, server-level rules or Cloudflare WAF for enforcement, and meta tags for page-level control. Layer all four methods for the strongest protection against unauthorized AI training.

Monitor withCopper Analytics

See <a href="/pricing">pricing plans</a> or <a href="/register">start your free account</a> to get crawler tracking set up in under two minutes. Know exactly which bots are hitting your site — and whether your blocks are working.

Know Which Bots Are Crawling Your Site

Block what you can. Track what you can't.Copper Analyticsshows you every AI crawler hitting your site — in real time.

What to Do Next

The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.

You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.