← Back to Blog·Feb 11, 2025·10 min read
AI Crawlers

New AI Crawlers to Block: 9 Bots Your Robots.txt Is Missing in 2026

The latest wave of AI crawlers is here, and most blocklists haven't caught up yet

At least 9 new AI crawlers have emerged since mid-2025 that your robots.txt probably doesn't cover yet

Stay ahead of emerging AI bots before they scrape your entire site without your knowledge

Why Your New AI Crawlers to Block List Is Already Outdated

If you updated your robots.txt to block GPTBot and ClaudeBot last year, you probably felt like you had the AI crawler problem handled. Unfortunately, the landscape has shifted dramatically. New AI crawlers to block are appearing every few months, and most webmasters have no idea they are already being scraped by bots they have never heard of.

The AI arms race has moved beyond the big players. Smaller companies, research labs, and niche search engines are all deploying their own crawlers to build training datasets and power AI-driven search results. Each one uses a different user-agent string, and many do not respect robots.txt at all unless you explicitly name them.

By the time a new AI crawler makes it onto a popular blocklist, it may have already indexed your entire site. The gap between a crawler's first appearance and widespread awareness can be weeks or even months. That window is exactly what these bots exploit.

The 9 New AI Crawlers You Need to Block Right Now

Here are the recently discovered AI bots that have been showing up in server logs since mid-2025. Each one serves a different purpose, but they all have one thing in common: they are harvesting your content without explicit permission.

These emerging AI crawlers represent companies ranging from well-funded startups to established tech giants launching secondary crawling programs. Adding them to your blocklist today closes gaps that most sites still have wide open.

  • <strong>OAI-SearchBot</strong> — OpenAI's newest crawler built specifically for SearchGPT. Unlike GPTBot which feeds training data, OAI-SearchBot indexes content for real-time AI search results. Blocking GPTBot alone does not stop this one.
  • <strong>cohere-ai</strong> — Cohere's web crawler collects training data for their enterprise LLM products. It has been quietly active since late 2025 and ramps up crawl rates on content-heavy sites.
  • <strong>YouBot</strong> — You.com's crawler powers their AI search engine. It crawls aggressively to keep their search index fresh, often hitting pages multiple times per day.
  • <strong>PetalBot</strong> — Operated by Huawei's Aspiegel subsidiary, PetalBot feeds data into Huawei's Petal Search and AI services. It has a global footprint and crawls from IP ranges that are not always easy to identify.
  • <strong>Timpibot</strong> — Timpi's decentralized search engine crawler. As a newer entrant, it flies under the radar of most blocklists but has been steadily increasing its crawl volume.
  • <strong>ImagesiftBot</strong> — Specifically targets images for AI training datasets. If your site has original photography, illustrations, or infographics, this bot is likely already cataloging them.
  • <strong>Applebot-Extended</strong> — Apple's secondary crawler that feeds Apple Intelligence features. Blocking the standard Applebot does not affect this one since it uses a separate user-agent string.
  • <strong>anthropic-ai</strong> — Anthropic's secondary agent crawler, distinct from ClaudeBot. It handles specialized data collection tasks and operates independently from the primary crawler.
  • <strong>Diffbot</strong> — Diffbot's crawler powers their structured data extraction API. It is used by numerous downstream services, meaning your content could end up in multiple AI products through a single crawl.

Watch Out

Several of these crawlers operate independently from their parent company's primary bot. Blocking GPTBot does not block OAI-SearchBot, and blocking ClaudeBot does not block anthropic-ai. You must add each user-agent string separately.

How to Update Your Robots.txt for New AI Bots

Adding new AI crawlers to block in your robots.txt is straightforward, but there are nuances that trip people up. A simple disallow directive works for crawlers that actually respect the protocol, though not all of them do.

Here is the process for updating your robots.txt to cover these latest AI crawlers. Apply these steps every time you discover a new bot in your server logs.

  1. Open your robots.txt file at the root of your domain (typically /robots.txt or the public directory in your framework)
  2. Add a new User-agent block for each crawler: User-agent: OAI-SearchBot followed by Disallow: / on the next line
  3. Repeat for each bot: cohere-ai, YouBot, PetalBot, Timpibot, ImagesiftBot, Applebot-Extended, anthropic-ai, and Diffbot
  4. Deploy the updated robots.txt and verify it is accessible at yourdomain.com/robots.txt
  5. Check your server logs 48 hours later to confirm the bots that respect robots.txt have stopped crawling

Bring External Site Data Into Copper

Pull roadmaps, blog metadata, and operational signals into one dashboard without asking every team to learn a new workflow.

Beyond Robots.txt: Server-Side Blocking for New AI Web Crawlers

Robots.txt is a voluntary protocol. Any crawler can choose to ignore it, and some of the newer bots do exactly that. For reliable protection against new AI web crawlers, you need server-side enforcement that actively blocks requests based on user-agent strings and IP ranges.

Firewall rules at the CDN or server level give you hard blocks that crawlers cannot bypass by simply ignoring your robots.txt. Services like Cloudflare, Vercel, and Nginx all support user-agent-based blocking rules.

The challenge is keeping your server-side rules updated as new crawlers appear. A rule you wrote six months ago will not catch a bot that launched last week. This is where automated detection becomes essential.

Pro Tip

Combine robots.txt with server-side user-agent checks and IP-based blocking for defense in depth. Relying on robots.txt alone is like putting up a "please do not enter" sign with no lock on the door.

How New AI Crawlers Slip Through Undetected

The reason emerging AI crawlers cause so much trouble is that they exploit the gap between their launch and your awareness of them. A new bot can crawl your entire site in a matter of hours, and if you are not actively monitoring your traffic for unknown user agents, you will never know it happened.

Many of these bots also rotate user-agent strings or use generic identifiers that blend in with legitimate traffic. Some disguise themselves as standard browsers, making them invisible to basic log analysis. Others crawl from residential IP ranges that do not match known bot networks.

The real risk is cumulative. Each new AI crawler that slips through adds up. Over the course of a year, a dozen unblocked bots could collectively scrape your site hundreds of times, feeding your original content into training datasets and AI search indexes you never consented to.

Building a System to Catch New AI Crawlers Automatically

Manually checking server logs for new AI bots is not sustainable. You need an automated system that flags unfamiliar user agents the moment they appear. This is the difference between reactive blocking, where you find out months later, and proactive detection, where you catch crawlers on their first visit.

A good detection system monitors every request hitting your site, compares user-agent strings against a known bot database, and alerts you when something new shows up. It should also track crawl patterns like unusually high request rates or systematic path traversal that indicate automated scraping even when the user agent looks normal.

Copper Analytics includes built-in AI crawler detection that automatically identifies and categorizes every bot visiting your site. When a new AI crawler appears that is not in any public blocklist yet, Copper flags it immediately so you can decide whether to block it before it finishes its first crawl cycle.

Instead of waiting for someone to publish an updated blocklist, you get real-time visibility into exactly which bots are accessing your content. The dashboard shows crawler frequency, pages targeted, and whether each bot respects your robots.txt directives.

Real-Time Detection

Copper Analytics users discovered OAI-SearchBot and Timpibot in their traffic weeks before these crawlers appeared on any major blocklist. Early detection is the only way to stay ahead.

Keeping Your New AI Crawler Blocklist Current: A Quarterly Routine

Blocking AI crawlers is not a set-it-and-forget-it task. The pace of new AI bots launching in 2026 means your blocklist needs regular maintenance. A quarterly review is the bare minimum to stay reasonably protected.

During each review, audit your server logs or analytics dashboard for any user agents you do not recognize. Cross-reference against updated community blocklists like the ai.robots.txt project and Dark Visitors database. Add any new entries to both your robots.txt and your server-side firewall rules.

Keep a changelog of when you added each crawler to your blocklist. This helps you track how quickly the landscape is evolving and makes it easier to brief your team on the latest threats. It also serves as documentation if you ever need to demonstrate due diligence around content protection.

  • Review server logs or analytics for unknown user agents every 90 days at minimum
  • Cross-reference your blocklist against ai.robots.txt and Dark Visitors for newly cataloged bots
  • Update both robots.txt and server-side firewall rules simultaneously
  • Document every addition with the date discovered and the crawler's known purpose
  • Set calendar reminders for quarterly reviews so the task does not slip through the cracks

What to Do Next

The right stack depends on how much visibility, workflow control, and reporting depth you need. If you want a simpler way to centralize site reporting and operational data, compare plans on the pricing page and start with a free Copper Analytics account.

You can also keep exploring related guides from the Copper Analytics blog to compare tools, setup patterns, and reporting workflows before making a decision.