The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore Web Crawling - cs.sfu.ca

View in Fullscreen

J. Pei: Information Retrieval and Web Search -- Web Crawling 3 Features of Crawlers • Must-have features of a crawler – Robustness: should not fall into spider ...

Like this book? You can publish your book online for free in a few minutes!

http://anyflip.com/siig/wxhc/

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by , 2016-07-01 23:42:03

Web Crawling - cs.sfu.ca

Pages:

1 - 50
51 - 52

J. Pei: Information Retrieval and Web Search -- Web Crawling 3 Features of Crawlers • Must-have features of a crawler – Robustness: should not fall into spider ...

Summary 51

•  Crawling and basic crawler architecture
•  Politeness and frontier
•  Handling updates
•  Deep web
•  Sitemap
•  Document feeds
•  Distributed crawlers
•  Conversion
•  Storing documents and BigTable
•  Near-duplicate detection
•  Removing noise

J. Pei: Information Retrieval and Web Search -- Web Crawling

To-Do List

•  Read Chapter 3
•  Between the shingling method and the

simhash method, which one is more
accurate? Why?
•  Web pages often contain ads. How can we
detect web pages containing duplicate
content but different ads?

J. Pei: Information Retrieval and Web Search -- Web Crawling 52

Pages:

1 - 50
51 - 52

Click to View FlipBook Version