• A system for downloading, storing and analysing web pages

  • Use cases

    • Search engine indexing
    • Web archiving
    • Web monitoring for copyright or trademark violation
  • Steps,

    Given a list of seed URLs β†’ Visit each URL β†’ store the web page β†’ Extract URLs from the current page β†’ Append the URLs to the list of URLs to visit β†’ repeat

  • Characteristics of a good web crawler,

    • Should be scalable
    • Should be robust enough to handle poorly formatted HTML, malicious sites, crashes, etc.
    • Should avoid making too many requests to a website in a very short time as it might lead to a DDoS attack
    • Should be extensible for future changes

Overview

Seed URLS

  • Initial links that are used at the starting point of the crawling process
  • Choosing the right URLs can impact the number of web pages crawled

URL Frontier

  • A queue data structure that holds the URLs that have to be fetched and analysed

HTML Fetcher

  • Downloads the web page pointed by the URL given by the URL frontier

DNS Resolver

  • Translates the URL to the web-page’s IP address

HTML Parser

  • Check integrity of a web page’s data
  • Checks for poorly formatted HTML and malware

Duplicate detection

  • Storing duplicated leads to unnecessary space usage and slow down the system

Cache

  • To improve web crawlers efficiency
  • Stores most recently crawled URLs

Data Storage

  • The web page’s data is stored in a storage system

URL Extractor

  • Extracts URLs from the current HTML page

URL Filter

  • Filters out faulty or malicious URLs

URL Loader or Detector

  • Filters out URLs that have already been visited
  • Bloom filters are commonly used (Why? extremely space efficient when compared to hash tables at the cost of missing out on some of the URLs)

URL Storage

  • Keeps track of all the visited URLs

References

  1. Varna’s presentation