← Back to Blog·Sep 17, 2024·10 min read
AI Crawlers

AI Crawler Robots.txt Guide: Block GPTBot, ClaudeBot, and More

A practical reference for using robots.txt to control AI crawler access — with specific directives for every major bot and the real-world limitations you need to know.

Robots.txt is your first line of defense against AI crawlers — but only if you write it correctly

Specific directives for GPTBot, ClaudeBot, Bytespider, and 6 more bots — plus what robots.txt cannot do

How Robots.txt Works for AI Crawlers

The robots.txt file has been the standard mechanism for telling web crawlers which pages they may and may not access since 1994. It sits at the root of your domain — <code>https://example.com/robots.txt</code> — and every well-behaved bot is expected to read it before crawling further. AI crawlers from OpenAI, Anthropic, Google, Meta, and others follow this same convention, at least in principle.

When an AI crawler like GPTBot arrives at your site, it first requests <code>/robots.txt</code> and looks for a matching <code>User-agent</code> directive. If it finds one with a <code>Disallow</code> rule covering the page it wants to fetch, it should back off. If no matching directive exists, the bot treats the entire site as open for crawling.

The critical word here is "should." Robots.txt is a voluntary protocol, not an access control mechanism. It contains no authentication, no encryption, and no enforcement. Reputable AI companies honor it because ignoring it would create legal and reputational risk — but not every crawler on the internet is reputable.

  • <strong>Voluntary compliance</strong> — robots.txt is an advisory standard, not a security boundary
  • <strong>Per-bot targeting</strong> — you can write separate rules for each AI crawler by its User-agent name
  • <strong>No partial access</strong> — you can allow or disallow paths, but you cannot restrict what the bot does with the content it fetches
  • <strong>Immediate effect</strong> — most major AI crawlers re-check robots.txt every 24-48 hours

Specific Robots.txt Directives for Every Major AI Crawler

Each AI company uses one or more named crawlers. If you want to block a specific bot, you need to know its exact User-agent string — a generic wildcard rule may not be enough, and it could also block search engine crawlers you want to keep. Below are the directives for the nine most common AI crawlers as of 2026.

To block all of these bots from your entire site, add the following blocks to your <code>robots.txt</code> file. Each block targets one crawler and disallows access to all paths.

If you only want to restrict AI crawlers from certain sections — for example, protecting premium content while leaving your blog open — replace <code>Disallow: /</code> with the specific paths you want to protect, such as <code>Disallow: /premium/</code> or <code>Disallow: /members/</code>.

  • <strong>GPTBot</strong> (OpenAI) — <code>User-agent: GPTBot</code> — Used for training and browsing in ChatGPT
  • <strong>ClaudeBot</strong> (Anthropic) — <code>User-agent: ClaudeBot</code> — Used for training Claude models
  • <strong>Bytespider</strong> (ByteDance) — <code>User-agent: Bytespider</code> — Used for TikTok and Doubao AI training
  • <strong>Google-Extended</strong> (Google) — <code>User-agent: Google-Extended</code> — Used for Gemini training, separate from Googlebot search indexing
  • <strong>PerplexityBot</strong> (Perplexity AI) — <code>User-agent: PerplexityBot</code> — Used for real-time answer generation
  • <strong>CCBot</strong> (Common Crawl) — <code>User-agent: CCBot</code> — Open dataset used by many AI labs for training
  • <strong>Meta-ExternalAgent</strong> (Meta) — <code>User-agent: Meta-ExternalAgent</code> — Used for Meta AI training data
  • <strong>amazonbot</strong> (Amazon) — <code>User-agent: amazonbot</code> — Used for Alexa and Amazon AI features
  • <strong>Applebot-Extended</strong> (Apple) — <code>User-agent: Applebot-Extended</code> — Used for Apple Intelligence training, separate from standard Applebot

Case Sensitivity Matters

User-agent strings are case-sensitive in practice. "GPTBot" is not the same as "gptbot" or "GPTBOT." Always use the exact capitalization published in each company's documentation to ensure your rules are matched correctly.

A Complete AI Crawler Robots.txt Template

Below is a ready-to-use robots.txt template that blocks all nine major AI crawlers while keeping your site fully accessible to search engines. Copy it into your <code>robots.txt</code> file and adjust the paths if you want to allow partial access.

This template uses one <code>User-agent</code> / <code>Disallow</code> pair per bot. Some webmasters try to combine multiple bots under a single wildcard rule, but this is risky — a broad wildcard like <code>User-agent: *</code> with <code>Disallow: /</code> will also block Googlebot, Bingbot, and every other crawler.

Keep your search engine rules separate from your AI crawler rules. The safest approach is explicit: one block per bot, with clear <code>Allow</code> and <code>Disallow</code> lines for each.

  1. Open your existing <code>robots.txt</code> file (usually at <code>public/robots.txt</code> in a Next.js project or the web root of your server)
  2. Add a <code>User-agent</code> block for each AI crawler you want to restrict, followed by <code>Disallow: /</code> to block the entire site
  3. Keep your existing <code>User-agent: Googlebot</code> and <code>User-agent: Bingbot</code> rules unchanged — these are your search engine crawlers
  4. Add a <code>Sitemap:</code> directive at the bottom pointing to your sitemap URL so search engines can still discover all pages
  5. Deploy the updated file and verify it is accessible at <code>https://yourdomain.com/robots.txt</code>
  6. Monitor your server logs or analytics dashboard over the next 48 hours to confirm AI crawlers have picked up the new rules

Bring External Site Data Into Copper

Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.

Common Mistakes with AI Crawler Robots.txt Rules

The robots.txt syntax is deceptively simple, and small errors can leave your site wide open to AI crawlers — or accidentally block search engines. Here are the most frequent mistakes webmasters make when writing AI bot robots.txt rules.

The first and most common error is using a blanket <code>User-agent: *</code> / <code>Disallow: /</code> rule and assuming it covers all AI bots. While technically every bot should respect the wildcard directive, many AI crawlers look for their own specific User-agent block first. If they do not find one, some interpret the wildcard rule differently or ignore it entirely. Always write explicit rules for each bot.

Another frequent issue is forgetting to update your robots.txt when new AI crawlers appear. The AI landscape changes fast — new bots launch every few months, and existing ones change their User-agent strings. A robots.txt file written in 2024 may already be missing several bots that launched in 2025 and 2026.

  • <strong>Wildcard-only rules</strong> — relying on <code>User-agent: *</code> instead of naming each AI bot explicitly
  • <strong>Typos in User-agent names</strong> — "ClaudBot" instead of "ClaudeBot," or "GPTbot" instead of "GPTBot"
  • <strong>Using Crawl-delay</strong> — this directive is not part of the official robots.txt standard and is ignored by most AI crawlers
  • <strong>Forgetting trailing slashes</strong> — <code>Disallow: /premium</code> blocks the path <code>/premium</code> but may not block <code>/premium/</code> on all implementations
  • <strong>Stale rules</strong> — not reviewing and updating your robots.txt as new AI crawlers emerge
  • <strong>Blocking your own tools</strong> — overly broad rules that accidentally block your site monitoring or analytics bots

Audit Regularly

Set a quarterly reminder to review your robots.txt file. Check for new AI crawler User-agent strings, verify your existing rules are still correct, and confirm that search engine crawlers are not accidentally blocked.

What Robots.txt Cannot Do: Real Limitations

Understanding what robots.txt cannot do is just as important as knowing how to write the rules. Too many site owners treat robots.txt as a security mechanism and are surprised when content still appears in AI training datasets or generated answers.

Robots.txt does not prevent access — it requests restraint. Any crawler can technically fetch any URL on your site regardless of what your robots.txt says. There is no authentication, no IP blocking, and no technical enforcement built into the protocol. A bot that ignores your robots.txt will face no technical barrier.

Robots.txt also cannot retroactively remove content that has already been crawled. If GPTBot indexed your site before you added a <code>Disallow</code> rule, that content may already be in OpenAI's training data. The new rule only prevents future crawling — it does not delete previously collected data. Most AI companies offer separate opt-out forms for removing content from existing training sets.

Finally, robots.txt provides no visibility into compliance. You publish the rules, but you have no built-in way to know whether bots are following them. This is where server log analysis and real-time crawler monitoring become essential — without them, your robots.txt is a policy you cannot verify.

Beyond Robots.txt: Additional AI Crawler Defenses

Because robots.txt relies on voluntary compliance, a robust AI crawler strategy uses multiple layers of defense. These complementary approaches work alongside your robots.txt rules to give you stronger control over how AI bots interact with your content.

HTTP response headers offer a more granular mechanism. The <code>X-Robots-Tag</code> header can be set per page or per content type, and directives like <code>noai</code> and <code>noimageai</code> signal to AI crawlers that the content should not be used for training. Unlike robots.txt, these headers travel with the response itself, so the bot receives the instruction at the moment it fetches the page.

Rate limiting is another effective tool. Most AI crawlers send requests at a much higher rate than human visitors. By configuring your web server or CDN to throttle requests from known AI crawler IP ranges, you can reduce the volume of content they collect even if they ignore your robots.txt. Cloudflare, AWS WAF, and Nginx all support rate limiting rules that can target specific User-agent patterns.

For sites that need the strongest protection, consider requiring JavaScript rendering for content access. Many AI crawlers do not execute JavaScript — they fetch raw HTML only. If your content is rendered client-side, simple fetch-based crawlers will see an empty page. This approach has trade-offs for SEO and accessibility, so it should be used selectively.

  • <strong>X-Robots-Tag headers</strong> — per-response directives like <code>noai</code> and <code>noimageai</code>
  • <strong>Rate limiting</strong> — throttle high-frequency requests from AI crawler User-agents or IP ranges
  • <strong>JavaScript rendering</strong> — serve content that requires client-side execution to display
  • <strong>IP blocklists</strong> — block known AI crawler IP ranges at the firewall or CDN level
  • <strong>TDM (Text and Data Mining) headers</strong> — the emerging <code>TDM-Reservation</code> protocol signals machine-readable licensing terms

The Layered Approach Wins

No single technique stops all AI crawlers. The most effective strategy combines robots.txt, HTTP headers, rate limiting, and real-time monitoring. Each layer catches bots that slip past the others.

Monitoring Whether AI Crawlers Respect Your Robots.txt Rules

Writing a robots.txt file is only half the job. The other half is verifying that AI crawlers actually follow it. Without monitoring, you are publishing rules into a void — you have no idea which bots are visiting, how often they come, or whether they respect your directives.

Server log analysis is the most direct approach. Every request to your site is recorded with the User-agent string, IP address, timestamp, and requested path. By filtering logs for known AI crawler User-agents — GPTBot, ClaudeBot, Bytespider, and others — you can see exactly what they are fetching and whether they accessed paths you disallowed.

However, raw server logs are difficult to work with at scale. They require parsing, deduplication, and cross-referencing against your robots.txt rules. This is where purpose-built monitoring tools become valuable. Copper Analytics provides a dedicated AI crawler dashboard that identifies every AI bot visiting your site, tracks their request patterns over time, and flags any crawler that accesses disallowed paths.

With Copper Analytics, you can set up alerts that notify you when a new AI crawler appears for the first time, when an existing crawler suddenly increases its request volume, or when any bot fetches a URL that your robots.txt explicitly disallows. This turns your robots.txt from a static text file into a monitored, enforceable policy.

What to Do Next

The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.

You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.