The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

J. Pei: Information Retrieval and Web Search -- Web Crawling 3 Features of Crawlers • Must-have features of a crawler – Robustness: should not fall into spider ...

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2016-07-01 23:42:03

Web Crawling - cs.sfu.ca

J. Pei: Information Retrieval and Web Search -- Web Crawling 3 Features of Crawlers • Must-have features of a crawler – Robustness: should not fall into spider ...

Summary 51

•  Crawling and basic crawler architecture
•  Politeness and frontier
•  Handling updates
•  Deep web
•  Sitemap
•  Document feeds
•  Distributed crawlers
•  Conversion
•  Storing documents and BigTable
•  Near-duplicate detection
•  Removing noise

J. Pei: Information Retrieval and Web Search -- Web Crawling

To-Do List

•  Read Chapter 3
•  Between the shingling method and the

simhash method, which one is more
accurate? Why?
•  Web pages often contain ads. How can we
detect web pages containing duplicate
content but different ads?

J. Pei: Information Retrieval and Web Search -- Web Crawling 52


Click to View FlipBook Version