Web Crawler Abuse
Also known as: Scraper abuse, Bad bot traffic
Aggressive, unauthorized, or deceptive automated crawling of a website — scraping content, harvesting data, ignoring robots.txt, or overwhelming the server with request volume.
Last updated:
What is web crawler abuse?
Legitimate crawlers — Googlebot, Bingbot, Applebot, trusted academic archivers — announce themselves honestly, respect robots.txt, rate-limit their requests, and identify their source IP ranges publicly. Web crawler abuse is everything on the other end of that spectrum: bots that scrape content without permission, harvest data for resale, ignore crawl directives, spoof legitimate User-Agents, or simply hit the site so hard they degrade service.
Common abusive crawler patterns
- Content scraping — republishing another site's articles, reviews, or product catalog on a knockoff domain or aggregator
- Price scraping — e-commerce competitors pulling pricing in real time to undercut
- Inventory scraping — sneaker, concert-ticket, and collectible bots monitoring stock for resale arbitrage
- Email harvesting — extracting contact addresses for spam and phishing
- LLM training data scraping — pulling entire sites into training corpora without consent
- Credential-stuffing reconnaissance — enumerating valid usernames or user IDs from profile URLs
- DoS-by-accident — an overly aggressive scraper that takes a small site down even without malicious intent
How to spot it
Signs in server logs: a single IP or ASN pulling tens of thousands of pages per hour, missing or obviously forged User-Agents, no session cookies, sequential or numeric URL patterns (/product/1, /product/2, …), complete ignorance of the site's robots.txt, and odd geographic distribution (all traffic from one datacenter range even though the site serves a specific region).
Defense
Layered defense works better than any single rule: rate-limit per IP and per ASN at the CDN, block requests that fail a User-Agent reputation check, challenge anomalous traffic with JavaScript/CAPTCHA (bot-management vendors like Cloudflare Bot Management, DataDome, and PerimeterX), and publish a robots.txt that names expected crawlers. Cross-referencing unfamiliar crawler IPs against an IP abuse report checker will flag known bad-bot infrastructure.