← Back to Blog·Sep 18, 2024·12 min read

AI Crawlers

AI Bot User Agent List: Every Major Crawler String You Need to Know in 2026

A definitive, bookmark-worthy reference of every AI crawler user agent string, who operates it, what it does, and whether it respects robots.txt.

Every AI crawler's exact user agent string — identified, cataloged, and explained

The definitive reference for identifying AI bots by their user agent in your server logs

Jump to section

Why AI Bot User Agents Matter for Every Website Owner

Every HTTP request that hits your server includes a user agent string — a short identifier that tells you what software sent the request. When an AI crawler visits your website, its user agent is the single most reliable way to identify it in your access logs. Without a comprehensive ai bot user agent list, you are flying blind.

The stakes have never been higher. AI companies are sending crawlers to scrape content at scale for model training, retrieval-augmented generation, and search indexing. Some of these bots are well-behaved and respect your robots.txt directives. Others crawl aggressively, ignore rate limits, and never identify themselves honestly.

Knowing the exact user agent string for each AI crawler lets you make informed decisions. You can selectively block bots you do not want, allow bots that drive traffic back to your site, and monitor bandwidth consumption by each crawler individually. This article gives you every string you need.

Identify AI crawlers in Nginx, Apache, Cloudflare, and CDN logs by their exact user agent
Decide which bots to block, rate-limit, or allow based on their operator and purpose
Monitor bandwidth and crawl frequency per bot to spot unexpected spikes
Build server-side rules (robots.txt, firewall, WAF) using verified user agent tokens

Complete AI Bot User Agent Reference Table

Below is the definitive ai bot user agent list for 2026, covering every major AI crawler actively operating on the public web. Each entry includes the exact user agent string (or token) as it appears in your server logs, the operating company, the bot's primary purpose, and whether it respects robots.txt directives.

User agent strings can appear in different forms. Some crawlers send a short token like <code>GPTBot</code> as part of a longer string, while others send a full identifier like <code>Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)</code>. When writing blocking rules, match on the unique token portion rather than the full string to catch all variants.

This reference is organized by operator. Bookmark this page — we update it as new crawlers appear and existing ones change their user agent strings.

GPTBot/1.0 — OpenAI — Crawls pages for ChatGPT training data and retrieval. Respects robots.txt. Token: <code>GPTBot</code>
ChatGPT-User — OpenAI — Fetches pages in real time when ChatGPT users browse the web. Respects robots.txt. Token: <code>ChatGPT-User</code>
OAI-SearchBot/1.0 — OpenAI — Powers OpenAI's search feature results. Respects robots.txt. Token: <code>OAI-SearchBot</code>
ClaudeBot/1.0 — Anthropic — Collects training data for Claude models. Respects robots.txt. Token: <code>ClaudeBot</code>
anthropic-ai — Anthropic — Older Anthropic crawler token, still occasionally seen. Respects robots.txt. Token: <code>anthropic-ai</code>
Bytespider — ByteDance — Aggressive crawler for TikTok and Douyin AI features. Partial robots.txt compliance. Token: <code>Bytespider</code>
Google-Extended — Google — Crawls content for Gemini AI training (separate from Googlebot). Respects robots.txt. Token: <code>Google-Extended</code>
PerplexityBot/1.0 — Perplexity AI — Fetches pages for Perplexity's AI-powered search engine. Respects robots.txt. Token: <code>PerplexityBot</code>
CCBot/2.0 — Common Crawl — Non-profit open web crawl used by many AI training pipelines. Respects robots.txt. Token: <code>CCBot</code>
Meta-ExternalAgent/1.0 — Meta — Crawls content for Meta AI training and features. Respects robots.txt. Token: <code>Meta-ExternalAgent</code>
FacebookBot — Meta — Fetches pages for link previews and AI features on Facebook and Instagram. Respects robots.txt. Token: <code>FacebookBot</code>
amazonbot/0.1 — Amazon — Crawls for Alexa answers and Amazon AI services. Respects robots.txt. Token: <code>amazonbot</code>
Applebot-Extended — Apple — Crawls content for Apple Intelligence and Siri AI features. Respects robots.txt. Token: <code>Applebot-Extended</code>
Diffbot — Diffbot — Structured data extraction for enterprise AI applications. Respects robots.txt. Token: <code>Diffbot</code>
ImagesiftBot — Imagesift — Crawls images for AI training datasets. Partial robots.txt compliance. Token: <code>ImagesiftBot</code>
Timpibot/0.9 — Timpi — Decentralized search engine crawler. Respects robots.txt. Token: <code>Timpibot</code>

Important

User agent strings are self-reported. A malicious bot can spoof any user agent, including Googlebot. Always cross-reference user agent identification with reverse DNS lookups and IP range verification for critical blocking decisions.

How to Identify AI Bots in Your Server Logs

Having the ai bot user agent list is only useful if you know how to find these strings in your actual traffic data. The process varies depending on your web server, CDN, and hosting setup, but the fundamentals are the same: search your access logs for the user agent tokens listed above.

For Nginx and Apache servers, your access logs typically live in <code>/var/log/nginx/access.log</code> or <code>/var/log/apache2/access.log</code>. Each line contains the user agent string in the final quoted field. A simple grep command can surface every AI crawler visit instantly.

If you use Cloudflare, Vercel, Netlify, or another edge provider, check their analytics dashboards or log drain features. Many CDNs now offer bot categorization, but their AI bot detection often lags behind newly deployed crawlers by weeks or months.

Open your server access log file or connect to your log aggregation service (ELK, Datadog, Splunk, etc.)
Search for each AI bot token: <code>grep -i "GPTBot\|ClaudeBot\|Bytespider\|Google-Extended\|PerplexityBot\|CCBot\|Meta-ExternalAgent\|amazonbot\|Applebot-Extended\|Diffbot\|FacebookBot\|ImagesiftBot\|Timpibot" access.log</code>
Count requests per bot: pipe the results through <code>awk</code> to extract the user agent field, then <code>sort | uniq -c | sort -rn</code>
Calculate bandwidth per bot by summing the bytes-transferred column for each matching user agent
Set up recurring monitoring — AI crawler activity changes weekly as operators adjust crawl rates and schedules

Bring External Site Data Into Copper

Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.

Get Started Free View Pricing

Robots.txt Compliance: Which AI Crawlers Actually Listen

One of the most important questions in the ai crawler user agents landscape is whether a given bot respects your robots.txt file. The answer is not a simple yes or no — it varies dramatically by operator, and compliance can change over time.

Most major AI companies now honor robots.txt directives after significant public pressure in 2023 and 2024. OpenAI's GPTBot, Anthropic's ClaudeBot, Google-Extended, and Meta-ExternalAgent all officially respect <code>Disallow</code> rules. Amazon's amazonbot and Apple's Applebot-Extended also comply.

The problem cases are well known. ByteDance's Bytespider has a history of aggressive crawling with inconsistent robots.txt compliance. ImagesiftBot has been flagged by multiple webmasters for ignoring directives. And CCBot, while operated by the non-profit Common Crawl, powers downstream AI training pipelines that the original website owner may not have consented to.

Even among compliant bots, there is a critical distinction: robots.txt only blocks future crawls. It does not retroactively remove your content from an AI model's training data. If GPTBot already scraped your site before you added a Disallow rule, that data is already in OpenAI's pipeline.

Pro Tip

Use specific User-agent directives for each AI crawler rather than a blanket Disallow. For example, you might block Bytespider and CCBot while allowing GPTBot (which drives ChatGPT browse-mode traffic back to your site). Granular control maximizes both protection and traffic opportunity.

How to Block AI Bots by User Agent String

Once you have identified which AI crawlers you want to block using this ai bot user agent list, implementing the blocks requires configuration at one or more layers of your stack. The most common approach is a combination of robots.txt for compliant bots and server-level rules for everything else.

For robots.txt, add a <code>User-agent</code> directive for each bot token you want to block, followed by <code>Disallow: /</code>. This works for GPTBot, ClaudeBot, Google-Extended, PerplexityBot, Meta-ExternalAgent, amazonbot, and Applebot-Extended — all of which have confirmed robots.txt compliance.

For bots that do not reliably respect robots.txt, you need server-side enforcement. In Nginx, use an <code>if ($http_user_agent ~* "Bytespider|ImagesiftBot")</code> block that returns a 403 status. In Apache, use <code>RewriteCond %{HTTP_USER_AGENT}</code> directives. Cloudflare users can create WAF rules that match on the user agent field.

robots.txt — Best for compliant bots. Zero server overhead. No guarantee of enforcement.
Nginx/Apache rules — Server-side blocking. Works regardless of bot compliance. Adds minimal processing overhead.
Cloudflare WAF — Edge-level blocking before requests reach your origin. Supports complex matching rules and rate limiting.
IP range blocking — Block entire CIDR ranges published by AI companies. Most effective but requires regular updates as IP ranges change.
.htaccess rules — Available on shared hosting. Same approach as Apache rewrite rules but scoped to a directory.

Emerging AI Crawlers and User Agents to Watch in 2026

The ai bot user agent list is a moving target. New AI companies launch crawlers regularly, and existing operators update their user agent strings, add new bot variants, or deploy entirely new crawlers for different purposes. Staying current requires ongoing vigilance.

In 2025 and early 2026, several new crawlers have gained significant traction. Cohere's <code>cohere-ai</code> crawler is appearing in more logs as the company scales its enterprise AI platform. Mistral AI has deployed crawlers for its European-focused models. AI search startups like You.com (<code>YouBot</code>) and Brave (<code>BraveBot</code>) are expanding their crawl footprints.

The most concerning trend is the rise of unlabeled or disguised crawlers. Some AI companies send requests with generic browser user agents — <code>Mozilla/5.0</code> strings that look identical to a real Chrome user. These are nearly impossible to detect by user agent alone and require behavioral analysis, TLS fingerprinting, or IP reputation checks to identify.

cohere-ai — Cohere — Enterprise AI platform training data collection
YouBot — You.com — AI-powered search engine indexing
Kangaroo Bot — Various — Generic label for unidentified AI-like crawlers
PetalBot — Huawei — Search and AI features for Huawei devices
Seekr — Seekr — AI content evaluation and search

Stay Updated

The Dark Visitors project (darkvisitors.com) maintains a community-sourced database of AI crawler user agents. Cross-referencing their list with your own server logs is a good way to catch crawlers that are not yet widely documented.

Automate AI Bot Identification With Copper Analytics

Manually grepping logs and maintaining a spreadsheet of ai bot user agent strings works for a single site, but it does not scale. If you manage multiple domains, handle significant traffic, or need real-time visibility into AI crawler activity, you need an automated solution.

Copper Analytics identifies every AI bot user agent automatically. When a request hits your site, Copper matches the user agent against a continuously updated database of known AI crawlers — including every bot listed in this article and dozens more. You see which bots are visiting, how often, which pages they target, and how much bandwidth they consume, all without touching a log file.

The platform also detects behavioral patterns that suggest undisclosed AI crawling — bots using generic user agents but exhibiting machine-like access patterns. This catches the disguised crawlers that pure user-agent matching misses entirely.

Whether you want to monitor, block, or selectively allow AI crawlers, having accurate identification is the foundation. Copper gives you that foundation out of the box, updated in real time as the AI crawler landscape evolves.

What to Do Next

The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.

You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.