← Back to Blog·Jan 14, 2025·14 min read
AI Crawlers

AI Web Crawler List 2026: Every Known Bot, Who Runs It, and What It Does

The definitive annual directory of every AI web crawler active in 2026 — organized by company, with user agents, purpose, compliance behavior, and current status.

Every AI web crawler active in 2026 — identified, cataloged, and rated for compliance

Which companies run them, what they collect, and whether they actually respect your robots.txt

Why a Complete AI Crawler List Matters in 2026

AI web crawlers have become one of the largest sources of non-human traffic on the internet. By early 2026, studies show that AI bot requests account for 30 to 50 percent of total traffic on many content-heavy websites. Without a comprehensive reference list, site owners have no way to distinguish legitimate search engine bots from aggressive AI training crawlers draining their bandwidth.

The challenge is that the AI crawler landscape changes constantly. New bots appear every quarter, existing ones rebrand or change behavior, and some that were once well-behaved start ignoring robots.txt after corporate policy shifts. A yearly snapshot — maintained and verified — is the only reliable way to keep track.

This article is that snapshot for 2026. We catalog every known AI web crawler by the company that operates it, what it does, how aggressively it crawls, whether it respects your directives, and when it was first observed in the wild. Bookmark this page and check back — it is the reference list that site owners, SEOs, and security teams search for every year.

  • AI crawlers now generate 30-50% of all bot traffic on content-heavy sites
  • New crawlers appear quarterly, making static blocklists obsolete within months
  • Compliance behavior varies widely — some bots respect robots.txt, others ignore it entirely
  • A maintained annual reference list is the fastest way to identify unknown user agents in your logs

OpenAI Crawlers: GPTBot, ChatGPT-User, and OAI-SearchBot

OpenAI operates three distinct crawlers in 2026, each serving a different purpose. GPTBot is the primary training data crawler — it systematically indexes web content to feed OpenAI's large language models. ChatGPT-User is a real-time fetcher that retrieves pages when a ChatGPT user asks for current information via browsing mode. OAI-SearchBot is the newest addition, powering OpenAI's search product that competes directly with Google.

GPTBot was first observed in August 2023 and has been one of the most discussed AI crawlers since its discovery. OpenAI provides official documentation for blocking it via robots.txt, and in testing, GPTBot does respect Disallow directives — though it can be aggressive in crawl frequency when not rate-limited. Its user agent string is <code>GPTBot/1.0</code> with a link to OpenAI's bot documentation page.

ChatGPT-User appeared in late 2023 and behaves differently from GPTBot. It makes targeted, on-demand requests rather than systematic crawls. It respects robots.txt but uses a separate user agent string (<code>ChatGPT-User</code>), meaning you need to block it independently if you want to prevent your content from appearing in ChatGPT browsing results. OAI-SearchBot, first seen in mid-2024, follows a similar compliance model.

  • GPTBot — Training data collection — User agent: GPTBot/1.0 — First seen: August 2023 — Respects robots.txt: Yes
  • ChatGPT-User — Real-time page fetching for ChatGPT browsing — User agent: ChatGPT-User — First seen: Late 2023 — Respects robots.txt: Yes
  • OAI-SearchBot — Search index for OpenAI search product — User agent: OAI-SearchBot/1.0 — First seen: Mid-2024 — Respects robots.txt: Yes

Pro Tip

To block OpenAI training but still appear in ChatGPT browsing and OpenAI search results, block only GPTBot in your robots.txt while allowing ChatGPT-User and OAI-SearchBot.

Anthropic, Google, and Apple AI Crawlers

Anthropic operates two known crawlers: ClaudeBot and anthropic-ai. ClaudeBot is the primary crawler responsible for collecting training data for the Claude model family. It was first observed in early 2024 and uses the user agent string <code>ClaudeBot/1.0</code>. Anthropic publishes clear robots.txt guidance and ClaudeBot generally respects directives, though some webmasters have reported higher-than-expected crawl rates during training data collection periods.

The <code>anthropic-ai</code> user agent appears in logs less frequently and seems to be used for research-related fetching rather than bulk training data collection. Both crawlers can be blocked independently via robots.txt, giving site owners granular control over what Anthropic accesses.

Google's AI-specific crawler is Google-Extended, introduced in September 2023 specifically to let site owners opt out of Gemini training data collection without affecting regular Google Search indexing. Its user agent is <code>Google-Extended</code>, and blocking it in robots.txt prevents your content from training Gemini while keeping your pages in Google Search results. Apple entered the AI crawling space with Applebot-Extended, which collects data for Apple Intelligence features. It respects robots.txt and uses the <code>Applebot-Extended</code> user agent.

  • ClaudeBot — Training data for Claude models — User agent: ClaudeBot/1.0 — First seen: Early 2024 — Respects robots.txt: Yes
  • anthropic-ai — Research fetching — User agent: anthropic-ai — First seen: 2023 — Respects robots.txt: Yes
  • Google-Extended — Gemini training data — User agent: Google-Extended — First seen: September 2023 — Respects robots.txt: Yes
  • Applebot-Extended — Apple Intelligence training — User agent: Applebot-Extended — First seen: Mid-2024 — Respects robots.txt: Yes

Bring External Site Data Into Copper

Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.

ByteDance, Meta, and Amazon AI Crawlers

Bytespider is ByteDance's notorious AI crawler and consistently ranks as one of the most aggressive bots on the internet. It collects training data for ByteDance's large language models and recommendation systems. First observed in 2023, Bytespider uses the user agent string <code>Bytespider</code> and has been documented crawling at extremely high rates — sometimes tens of thousands of requests per hour on a single site. While ByteDance claims Bytespider respects robots.txt, many webmasters report that it ignores or partially obeys directives.

Meta operates two crawlers relevant to AI: Meta-ExternalAgent and FacebookBot. Meta-ExternalAgent is the dedicated AI training crawler, first seen in mid-2024, using the user agent <code>Meta-ExternalAgent/1.0</code>. It collects data for Meta's Llama models and AI features across Facebook, Instagram, and WhatsApp. FacebookBot is an older crawler primarily used for link previews and Open Graph rendering, but Meta has expanded its scope to support AI features. Both technically respect robots.txt, though Meta-ExternalAgent can be aggressive in crawl volume.

Amazon's Amazonbot has been active since 2022, originally serving Alexa's knowledge graph. In 2024, its role expanded to support Amazon's generative AI initiatives including Rufus, the shopping assistant. It uses the user agent <code>Amazonbot/0.1</code> and respects robots.txt directives. Compared to Bytespider or GPTBot, Amazonbot tends to be less aggressive in crawl frequency, typically focusing on product-related and factual content.

  • Bytespider — ByteDance LLM and recommendation training — User agent: Bytespider — First seen: 2023 — Respects robots.txt: Partially (disputed)
  • Meta-ExternalAgent — Llama model training — User agent: Meta-ExternalAgent/1.0 — First seen: Mid-2024 — Respects robots.txt: Yes
  • FacebookBot — Link previews and AI features — User agent: FacebookBot — First seen: 2019 (expanded scope 2024) — Respects robots.txt: Yes
  • Amazonbot — Alexa knowledge + Rufus AI — User agent: Amazonbot/0.1 — First seen: 2022 — Respects robots.txt: Yes

Watch Out

Bytespider is widely reported to ignore robots.txt Disallow rules despite ByteDance's official claims of compliance. If you need to block it, consider supplementing robots.txt with user agent-based server rules or WAF blocks.

Independent and Specialized AI Crawlers: The Full Roster

Beyond the big tech companies, a growing ecosystem of independent AI crawlers operates across the web. PerplexityBot powers Perplexity AI's search engine, making real-time requests to answer user queries. First seen in 2024, it uses the user agent <code>PerplexityBot</code> and officially respects robots.txt, though it has faced controversy over aggressive crawl behavior and alleged directive violations reported by multiple publishers.

CCBot is the crawler operated by Common Crawl, a nonprofit that has been archiving the web since 2011. Its dataset is one of the most widely used training corpora for AI models — including early GPT models. CCBot uses the user agent <code>CCBot/2.0</code> and respects robots.txt. While Common Crawl itself is a nonprofit, blocking CCBot effectively prevents your content from entering dozens of downstream AI training pipelines.

Cohere operates <code>cohere-ai</code> for its enterprise-focused language models. Diffbot uses <code>Diffbot</code> to extract structured data and build its Knowledge Graph. YouBot (<code>YouBot</code>) supports You.com's AI search engine. PetalBot (<code>PetalBot</code>) is operated by Huawei for its Petal Search product. ImagesiftBot (<code>ImagesiftBot</code>) focuses on image content analysis, while Timpibot (<code>Timpibot</code>) supports the Timpi decentralized search engine.

Each of these crawlers has different crawl patterns, frequencies, and compliance behaviors. Some are well-documented with clear opt-out mechanisms, while others provide minimal transparency. The table below summarizes their key attributes.

  • PerplexityBot — AI search engine — User agent: PerplexityBot — First seen: 2024 — Respects robots.txt: Claimed (disputed)
  • CCBot — Common Crawl archive (used by many AI trainers) — User agent: CCBot/2.0 — First seen: 2011 — Respects robots.txt: Yes
  • cohere-ai — Enterprise LLM training — User agent: cohere-ai — First seen: 2023 — Respects robots.txt: Yes
  • Diffbot — Structured data extraction and Knowledge Graph — User agent: Diffbot — First seen: 2012 — Respects robots.txt: Yes
  • YouBot — You.com AI search — User agent: YouBot — First seen: 2023 — Respects robots.txt: Yes
  • PetalBot — Huawei Petal Search — User agent: PetalBot — First seen: 2020 — Respects robots.txt: Yes
  • ImagesiftBot — Image content analysis — User agent: ImagesiftBot — First seen: 2024 — Respects robots.txt: Yes
  • Timpibot — Decentralized search — User agent: Timpibot — First seen: 2024 — Respects robots.txt: Yes

Compliance, Aggressiveness, and What to Watch For

Robots.txt compliance among AI crawlers falls into three tiers. The first tier — full compliance — includes crawlers that consistently honor both Disallow and Crawl-delay directives. Google-Extended, Applebot-Extended, CCBot, and Amazonbot fall into this category. They follow the rules, respect rate limits, and provide documented opt-out paths.

The second tier is nominal compliance: crawlers that officially respect robots.txt Disallow rules but ignore Crawl-delay directives or crawl at rates that overwhelm smaller servers. GPTBot, ClaudeBot, and Meta-ExternalAgent fall here. They will stop crawling pages you block, but they may hammer your server with requests for pages you have not blocked, consuming significant bandwidth and CPU.

The third tier is disputed compliance. Bytespider and PerplexityBot have both faced public accusations of ignoring robots.txt directives entirely. Multiple site operators have published evidence of these crawlers accessing blocked paths. If you rely on robots.txt alone to protect content from these crawlers, you may want to verify with server logs that your directives are actually being honored.

Aggressiveness varies dramatically. Bytespider can generate tens of thousands of requests per hour. GPTBot and ClaudeBot typically range from hundreds to low thousands per hour depending on site size. Amazonbot and Applebot-Extended tend to be the most polite, with modest request rates that rarely cause performance issues.

Key Insight

Robots.txt is a voluntary standard — no crawler is technically required to obey it. For critical content protection, always verify compliance by checking your server logs against your robots.txt rules. Copper Analytics automates this verification for you.

Track Every AI Crawler on This List With Copper Analytics

Maintaining a blocklist manually is a losing game. New crawlers appear, user agent strings change, and compliance behavior shifts without notice. Copper Analytics solves this by automatically detecting and categorizing every AI crawler on this list — and new ones as they emerge — in real time across your entire site.

The Copper dashboard shows you exactly which AI crawlers are hitting your site, how many requests each one makes, which pages they target most, and whether they are respecting your robots.txt directives. You get a live feed of bot activity alongside historical trend data so you can see how AI crawler traffic has changed over weeks and months.

When a new crawler appears in your traffic that is not yet on any public list, Copper flags it immediately. You get alerts for unknown bot user agents, sudden traffic spikes from known crawlers, and robots.txt violations — all without parsing a single log file yourself.

If you manage multiple sites or work at an agency, Copper aggregates AI crawler data across all your properties in one view. You can export reports, set up automated weekly digests, and share crawler intelligence with your team. Every bot on this 2026 list is detected out of the box, and the detection library updates continuously as the crawler landscape evolves.

  • Automatic detection of every crawler on this list — no manual configuration needed
  • Real-time alerts for new unknown bots, traffic spikes, and robots.txt violations
  • Historical trend data to track how AI crawler activity changes over time
  • Multi-site aggregation for agencies and teams managing multiple properties

What to Do Next

The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.

You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.