The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

What Is Web Scraping? – Semalt
Explains The Role Of BeautifulSoup In
Web Scraping

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by vah71511, 2018-08-06 14:49:19

article2109

What Is Web Scraping? – Semalt
Explains The Role Of BeautifulSoup In
Web Scraping

23.05.2018

What Is Web Scraping? – Semalt
Explains The Role Of BeautifulSoup In
Web Scraping

Web pages are built with text-based programming languages such as HTML and XHTML. They contain a wealth of
information in the form of images, videos, and text. All web pages are designed for humans and are meaningless for
automated bots. Companies like Google and Amazon AWS provide various web scraping services, software,
techniques and tools to ease your work. Some of these tools are free of cost, while the others are priced from $20 to
$2000.

What is web scraping?

Web scraping is the practice of extracting data from different websites, and web crawling is one of its main

components. Once the data is fetched, it may be parsed or reformatted as per your requirements. Web scraping

tools copy the data into spreadsheets or download it to your hard drive for of ine uses.

https://rankexperience.com/articles/article2109.html 1/2

23.05.2018

The role of BeautifulSoup in web scraping:

Some companies use Python-based libraries to scrape data. They detect different web pages, collect useful data,
scrape it properly, and download to their hard drives. Even some web scrapers depend on techniques like DOM
parsing, BeautifulSoup, Scrapy and Lxml to scrape data properly. There are cases when the information you want
can be accessed and scraped with ordinary techniques and tools. In such circumstances, BeautifulSoup is the right
framework for you.

The major components of a web page:

Before we scrape data using BeautifulSoup, let us check out the different
components of a web page. There are four main components of a web
page: HTML, CSS, JS and Images. HTML contains the main content of a
page. CSS is used to add styles to a page and make it look good. JS or
JavaScript adds uniqueness and interactivity to a web page. Note that
pictures can make a page look lively. The most common formats of images
are PNG and JPG.

Extract data from HTML documents with BeautifulSoup:

It is possible to extract data from HTML documents or PDF les with BeautifulSoup. HTML (Hyper Text Markup
Language) is a famous language used to create and build web pages. Just like Python, HTML is a markup language
that tells the browser how to layout the web content. HTML lets you create paragraphs and gives a great look to
your text. You can then save your data in different forms.

1. The Requests library:

First of all, you should download web pages using the Requests library. This will help you download HTML text and
images easily.

2. Parse the page with BeautifulSoup:

You can now use BeautifulSoup library to parse your HTML text and web documents. BeautifulSoup is the Python
package that creates parse trees and is used to extract data from HTML documents. It is available for both Python
2.6 and Python 3.

Different tags you should know about:

Different forms of tags used in web scraping are Child, Parent and Sibling. Child is a tag inside the Parent tag. Parent
is a tag that is wrapped around a Child tag, and Sibling is the tag that gets nested inside the Parent tag, but its
location is different from the Child tag.

https://rankexperience.com/articles/article2109.html 2/2


Click to View FlipBook Version