Web crawlers

A system for downloading, storing and analysing web pages
Use cases
- Search engine indexing
- Web archiving
- Web monitoring for copyright or trademark violation
Steps,

Given a list of seed URLs → Visit each URL → store the web page → Extract URLs from the current page → Append the URLs to the list of URLs to visit → repeat
Characteristics of a good web crawler,
- Should be scalable
- Should be robust enough to handle poorly formatted HTML, malicious sites, crashes, etc.
- Should avoid making too many requests to a website in a very short time as it might lead to a DDoS attack
- Should be extensible for future changes

Overview

Cache

Data Storage

Filters out URLs that have already been visited
Bloom filters are commonly used (Why? extremely space efficient when compared to hash tables at the cost of missing out on some of the URLs)

URL Storage