Web Scraping using Python and BeautifulSoup

(Comments)

In this article, we’ll see how to perform web scraping using Python and the BeautifulSoup library.

We need Python and BeautifulSoup installed.

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip.

The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.

easy_install beautifulsoup4

pip install beautifulsoup4

Consider an example html file with the below code:

<html>
    <head>
       <title>BeautifulSoup Tutorial</title>
    </head>
    <body>
        <p class="para1">
            This is the first paragraph.
        </p>
    </body>
</html>

Let us assume the url of this file is "https://www.example.com/example.html"

Python Requests Librabry:

We can download the page using the python requests library as shown below:

import requests

page = requests.get("https://www.example.com/example.html")

If the request is successful, we will get a status code of 200.

page.status_code

page.content will have the HTML content of the file we downloaded.

page.content

BeautifulSoup:

Now we have html page we got using python requests. Let's see how to use BeautifulSoup Library to explore the HTML content.

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:

print(soup.prettify())


<!DOCTYPE html>
<html>
    <head>
       <title>BeautifulSoup Tutorial</title>
    </head>
    <body>
        <p class="para1">
            This is the first paragraph.
        </p>
    </body>
</html>

The top level elements of the page can be selected using the children property of soup. Children returns a list generator, so we need to call the list function on it:

list(soup.children)

['html', '\n', <html> <head> <title>BeautifulSoup Tutorial</title> </head> <body> <p class="para1"> This is the first paragraph. </p> </body> </html>]

If we check the type of the elements, we will know that type of <html> is tag element.

type(list(soup.children))[2]

bs4.element.Tag

This the object we use to explore the HTML content and get the data from the HTML file.

html = list(soup.children)[2]

Note that the html is also a BeautifulSoup object. So, we can use children property on html object as well.

list(html.children)

['\n', <head> <title>BeautifulSoup Tutorial</title> </head>, '\n', <body> <p class="para1">This is the first paragraph.</p> </body>, '\n']

So in order to get the text in the <p></p> tags, we can use the html object as show below:

html = list(soup.children)[2]
body = list(html.children)[3]
p = list(body.children)[1]
p.get_text()

This is the first paragraph.

We can also find all instances of a tag at once instead of going level by level as we did above. To find all instances of a tag, we can use the find_all method of the BeautifulSoup object.

We already have the soup object. Let's use the find_all method to get all instances of <p> tag.

soup.find_all('p')

[<p class="para1">This is the first paragraph.</p>]

Note that find_all method returns a list. To get only the first instance of the tag, we can use find method.

soup.find('p')
soup.find('p').get_text()

<p class="para1">This is the first paragraph.</p> This is the first paragraph.

We can also search by using the class or id of the tag.

soup.find_all('p', class_='para1')

[<p class="para1">This is the first paragraph.</p>]

soup.find_all('p', id='id of tag')

We didn't get any result because we don't have id in our example HTML code.

Comments

Recent Posts

Archive

2019
2018
2017
2016
2015
2014

Tags

Authors

Feeds

RSS / Atom