Summary 51
• Crawling and basic crawler architecture
• Politeness and frontier
• Handling updates
• Deep web
• Sitemap
• Document feeds
• Distributed crawlers
• Conversion
• Storing documents and BigTable
• Near-duplicate detection
• Removing noise
J. Pei: Information Retrieval and Web Search -- Web Crawling
To-Do List
• Read Chapter 3
• Between the shingling method and the
simhash method, which one is more
accurate? Why?
• Web pages often contain ads. How can we
detect web pages containing duplicate
content but different ads?
J. Pei: Information Retrieval and Web Search -- Web Crawling 52