← Back to Blog·Aug 27, 2024·11 min read

AI Crawlers

AI Crawler Rate Limiting: Control Bot Traffic Without Blocking

Rate limiting gives you a smarter alternative to outright blocking. Throttle GPTBot, Bytespider, ClaudeBot, and other AI crawlers to protect your server while keeping the door open for AI-driven discovery.

Throttle AI crawlers to protect your server without losing GEO visibility

Rate limit GPTBot, Bytespider, ClaudeBot, and other AI bots — practical configs for Nginx, Apache, Cloudflare, and HAProxy

Jump to section

Why Rate Limit AI Crawlers Instead of Blocking Them

The instinct when AI crawlers hammer your server is to block them outright. Add a Disallow in robots.txt, drop the user-agent at the firewall, and move on. But blocking is a binary decision with real trade-offs that many site owners overlook.

AI crawlers that index your content feed it into language models. When users ask those models questions related to your domain, your content may influence — or even be cited in — the response. This emerging channel, called Generative Engine Optimization (GEO), is becoming a meaningful traffic source for SaaS companies, publishers, and documentation sites.

Rate limiting gives you a middle path. You allow AI crawlers to access your content, but you control the pace. Instead of thousands of requests per minute overwhelming your origin server, you cap them at a sustainable rate — say, 2 requests per second. The crawler still gets your content, your server stays healthy, and you preserve your GEO presence.

The practical difference is significant. A site that blocks GPTBot entirely disappears from ChatGPT responses. A site that rate limits GPTBot to 1 request per second still gets fully indexed — it just takes the crawler longer to finish. Your content still enters the training pipeline and the real-time retrieval index.

Blocking vs. Rate Limiting

Blocking removes you from AI models entirely. Rate limiting keeps you visible while protecting your infrastructure. For most sites, rate limiting is the smarter default.

Which AI Crawlers Need the Tightest Rate Limits

Not all AI crawlers behave the same way. Some are polite and self-throttle. Others will saturate your connection if you let them. Knowing which bots are aggressive helps you set proportional limits instead of applying a single blanket rule.

Bytespider, operated by ByteDance for TikTok and Doubao, is consistently the most aggressive AI crawler. It often ignores Crawl-delay directives in robots.txt and can send 10 or more requests per second to a single site. Many webmasters report Bytespider consuming more bandwidth than all other AI crawlers combined.

GPTBot from OpenAI is moderately aggressive. It generally respects robots.txt but does not honor Crawl-delay. Without rate limiting, GPTBot can send 3-5 requests per second during active crawl sessions, which is manageable for large sites but problematic for smaller ones on shared hosting.

ClaudeBot from Anthropic is generally well-behaved. It self-throttles, respects robots.txt, and tends to crawl at 1-2 requests per second or slower. Most sites do not need to rate limit ClaudeBot specifically, though including it in your rules as a safety net is reasonable.

Crawler	Company	Aggressiveness	Recommended Limit
Bytespider	ByteDance	Very High	1-2 req/sec
GPTBot	OpenAI	Moderate-High	2-3 req/sec
PerplexityBot	Perplexity	Moderate	2-3 req/sec
Meta-ExternalAgent	Meta	Variable	2 req/sec
ClaudeBot	Anthropic	Low	3-5 req/sec
Googlebot-Extended	Google	Low-Moderate	3-5 req/sec

AI Crawler Rate Limiting with Nginx

Nginx is the most popular reverse proxy for rate limiting AI crawlers because its limit_req_zone module is fast, lightweight, and does not require additional software. The key is mapping the user-agent string to a rate limit zone so that AI bots get throttled while human visitors are unaffected.

nginx.confnginx

# Map AI crawler user-agents to a rate limit key
map $http_user_agent $ai_bot_limit_key {
    default           "";
    "~*bytespider"    $binary_remote_addr;
    "~*gptbot"        $binary_remote_addr;
    "~*claudebot"     $binary_remote_addr;
    "~*perplexitybot" $binary_remote_addr;
    "~*meta-externalagent" $binary_remote_addr;
}

# Create rate limit zone: 2 requests/second for AI crawlers
limit_req_zone $ai_bot_limit_key zone=ai_bots:10m rate=2r/s;

server {
    listen 443 ssl;
    server_name yoursite.com;

    # Apply rate limit with burst tolerance
    limit_req zone=ai_bots burst=5 nodelay;
    limit_req_status 429;

    # ... rest of your server config
}

The configuration uses a map block to detect AI crawler user-agent strings and assign them to a named variable. A limit_req_zone then creates a shared memory zone that tracks request rates per crawler. When a bot exceeds the defined rate, Nginx returns a 429 Too Many Requests response.

This approach is efficient because Nginx evaluates the map at request time with minimal overhead. The 10m shared memory zone can track roughly 160,000 unique keys, which is far more than you will ever need for bot rate limiting. Adjust the rate value to match your server capacity — start conservative and loosen as needed.

Implementation Steps

Add a map block in the http context to detect AI crawler user-agents and assign a rate limit key
Define a limit_req_zone with a shared memory zone (10m) and your desired rate (e.g., 2r/s)
Apply limit_req inside the server or location block with burst and nodelay parameters
Set a custom 429 error page so bots receive a clear rate limit response
Test with curl using a spoofed user-agent: curl -A "GPTBot/1.0" -I https://yoursite.com

Burst Parameter

Set burst=5 nodelay to allow short bursts above the rate limit before throttling kicks in. This prevents false positives when a crawler legitimately requests a page and its resources in quick succession.

Bring External Site Data Into Copper

Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.

Get Started Free View Pricing

Rate Limiting AI Crawlers on Apache, Cloudflare, and HAProxy

Not every infrastructure runs Nginx. Here are equivalent rate limiting configurations for the three other major platforms where AI crawler throttling is commonly needed.

.htaccess or httpd.confapache

# Apache: Throttle AI crawlers with mod_ratelimit + mod_setenvif
SetEnvIfNoCase User-Agent "bytespider" ai_crawler
SetEnvIfNoCase User-Agent "gptbot" ai_crawler
SetEnvIfNoCase User-Agent "claudebot" ai_crawler
SetEnvIfNoCase User-Agent "perplexitybot" ai_crawler

<If "env('ai_crawler')">
    # Limit bandwidth to 50KB/s for AI crawlers
    SetOutputFilter RATE_LIMIT
    SetEnv rate-limit 50
</If>

Apache uses mod_ratelimit combined with mod_setenvif to detect user-agent strings and apply per-connection bandwidth throttling. While Apache does not have a built-in request rate limiter as elegant as Nginx, combining mod_ratelimit with mod_evasive gives you both bandwidth throttling and request rate protection.

haproxy.cfghaproxy

# HAProxy: Rate limit AI crawlers using stick-tables
frontend http_front
    bind *:443 ssl crt /etc/haproxy/certs/

    # Identify AI crawlers
    acl is_ai_bot hdr_sub(User-Agent) -i bytespider gptbot claudebot perplexitybot

    # Track request rate in a stick-table
    stick-table type ip size 100k expire 60s store http_req_rate(10s)
    http-request track-sc0 src if is_ai_bot

    # Deny if AI bot exceeds 20 requests per 10 seconds
    http-request deny deny_status 429 if is_ai_bot { sc_http_req_rate(0) gt 20 }

Cloudflare WAF rate limiting rules are the easiest to configure if you already use Cloudflare as your CDN. You can create a custom rule that matches AI crawler user-agents and limits them to a specific number of requests per time window. The advantage is that rate limiting happens at the edge before traffic reaches your origin server.

HAProxy uses stick-tables to track request rates per source IP and user-agent combination. This is particularly powerful in load-balanced environments where you need consistent rate limiting across multiple backend servers. The stick-table approach also gives you real-time visibility into which bots are hitting their limits.

Choosing the Right AI Crawler Rate Limits for Your Site

Setting rate limits too tight blocks crawlers effectively but defeats the purpose of rate limiting over blocking. Setting them too loose wastes server resources. The right number depends on your server capacity, content volume, and how much crawler traffic you actually receive.

Start by measuring your current AI crawler request rates. If you do not know how many requests per second Bytespider or GPTBot sends to your site, you are guessing at limits rather than setting informed ones. Copper Analytics shows per-bot request rates in real time, making it straightforward to see your actual traffic patterns before configuring limits.

2 req/s

Recommended starting rate limit

burst=5

Burst allowance for short spikes

~8 min

Time to crawl 1,000 pages at 2 req/s

A general starting point for most sites: set AI crawlers to 2 requests per second with a burst allowance of 5. This means a bot can briefly spike to 7 requests in a second but must average 2 or fewer over time. For a 1,000-page site, a crawler at 2 requests per second will finish indexing in about 8 minutes — fast enough to keep your content fresh in AI models.

For smaller sites on shared hosting or budget VPS instances, consider dropping to 1 request per second. For large sites on dedicated infrastructure, you can afford to be more generous at 5-10 requests per second. The key is matching the limit to what your server can comfortably handle during peak human traffic hours.

Do Not Guess

Setting rate limits without first measuring your actual AI crawler traffic is like setting a speed limit without knowing the road. Use Copper Analytics or server logs to measure current request rates before configuring limits.

Monitoring Whether Your AI Crawler Rate Limits Are Working

Deploying rate limits is only half the job. You need to verify they are actually working and adjust them over time as AI crawler behavior changes. A rate limit that was effective three months ago may be too loose today if a crawler has ramped up its request volume.

The first thing to check after deploying rate limits is your 429 response rate. If you are seeing zero 429 responses from AI crawlers, either your limits are too generous or the crawlers have not visited since you deployed. If you are seeing a high percentage of 429 responses — say more than 50% of AI crawler requests — your limits may be too tight and you are effectively blocking rather than throttling.

Monitor your server resource utilization alongside crawler traffic. The goal of rate limiting is to keep CPU, memory, and bandwidth within comfortable margins during crawler activity. If your server still struggles during heavy AI crawl sessions despite rate limits, tighten the limits further or consider per-bot limits instead of a single blanket rule.

Monitoring Checklist

Check your 429 response rate — aim for 5-20% of AI crawler requests being rate-limited
Monitor server CPU and memory during active crawl sessions to ensure limits are effective
Compare before and after bandwidth consumption from AI crawlers using Copper Analytics
Review rate limit logs weekly for the first month, then monthly after tuning is stable
Watch for new AI crawler user-agents that bypass your existing rules

See Your AI Crawler Request Rates in Real Time

Copper Analytics tracks per-bot request rates so you can set informed rate limits. Free tier includes full AI crawler monitoring.

Start Monitoring Free See Features

Edge Cases and Advanced Rate Limiting Strategies

Basic user-agent rate limiting covers most scenarios, but AI crawlers are evolving. Some advanced situations require more nuanced approaches to keep your rate limiting effective without creating new problems.

Some AI crawlers rotate IP addresses across large pools, making IP-based rate limiting less effective. If you notice a bot spreading requests across many IPs to circumvent per-IP limits, switch to user-agent-based rate limiting or combine both dimensions. Nginx and HAProxy both support composite rate limit keys that factor in both the user-agent and the source IP.

Advanced Strategies

Tiered Rate Limits

Assign different rate limits per bot. Give ClaudeBot 5 req/s and Bytespider 1 req/s. Reward compliant crawlers and penalize aggressive ones.

Composite Rate Keys

Combine user-agent and IP in your rate limit key to prevent evasion through IP rotation while still targeting specific bots.

Behavioral Detection

Use request pattern analysis to catch AI crawlers that hide behind generic user-agent strings. Look for sequential page access and no JS execution.

Another edge case is AI crawlers that do not identify themselves. Some bots use generic user-agent strings like "Mozilla/5.0" to blend in with human traffic. Rate limiting these requires behavioral analysis — looking at request patterns, session duration, and crawl paths that distinguish bots from humans. This is where a dedicated tool like Copper Analytics helps, because it uses multiple signals beyond just the user-agent string.

Finally, consider implementing tiered rate limits. Give well-behaved crawlers like ClaudeBot a higher limit (5 req/s) and aggressive ones like Bytespider a lower limit (1 req/s). This rewards compliance and penalizes bad behavior, which is fairer and more effective than a one-size-fits-all approach.

What to Do Next

The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.

You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.