Web Scraping using Python and BeautifulSoup

Posted by: ganesh 6 years, 9 months ago

In this article, we’ll see how to perform web scraping using Python and the BeautifulSoup library.

We need Python and BeautifulSoup installed.

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip.

The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.

easy_install beautifulsoup4

pip install beautifulsoup4

Consider an example html file with the below code:

<html>
    <head>
       <title>BeautifulSoup Tutorial</title>
    </head>
    <body>
        <p class="para1">
            This is the first paragraph.
        </p>
    </body>
</html>

Let us assume the url of this file is "https://www.example.com/example.html"

Python Requests Librabry:

We can download the page using the python requests library as shown below:

import requests

page = requests.get("https://www.example.com/example.html")

If the request is successful, we will get a status code of 200.

page.status_code

page.content will have the HTML content of the file we downloaded.

page.content

b'<!DOCTYPE html>\n<html>\n <head>\n <title>BeautifulSoup Tutorial</title>\n </head>\n <body>\n This is the first paragraph.\n </body>\n</html>'

BeautifulSoup:

Now we have html page we got using python requests. Let's see how to use BeautifulSoup Library to explore the HTML content.

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:

print(soup.prettify())


<!DOCTYPE html>
<html>
    <head>
       <title>BeautifulSoup Tutorial</title>
    </head>
    <body>
        <p class="para1">
            This is the first paragraph.
        </p>
    </body>
</html>

The top level elements of the page can be selected using the children property of soup. Children returns a list generator, so we need to call the list function on it:

list(soup.children)

['html', '\n', <html> <head> <title>BeautifulSoup Tutorial</title> </head> <body> This is the first paragraph. </body> </html>]

If we check the type of the elements, we will know that type of <html> is tag element.

type(list(soup.children))[2]

bs4.element.Tag

This the object we use to explore the HTML content and get the data from the HTML file.

html = list(soup.children)[2]

Note that the html is also a BeautifulSoup object. So, we can use children property on html object as well.

list(html.children)

['\n', <head> <title>BeautifulSoup Tutorial</title> </head>, '\n', <body> This is the first paragraph. </body>, '\n']

So in order to get the text in the  tags, we can use the html object as show below:

html = list(soup.children)[2]
body = list(html.children)[3]
p = list(body.children)[1]
p.get_text()

This is the first paragraph.

We can also find all instances of a tag at once instead of going level by level as we did above. To find all instances of a tag, we can use the find_all method of the BeautifulSoup object.

We already have the soup object. Let's use the find_all method to get all instances of tag.

soup.find_all('p')

[This is the first paragraph.]

Note that find_all method returns a list. To get only the first instance of the tag, we can use find method.

soup.find('p')
soup.find('p').get_text()

This is the first paragraph. This is the first paragraph.

We can also search by using the class or id of the tag.

soup.find_all('p', class_='para1')

[This is the first paragraph.]

soup.find_all('p', id='id of tag')

We didn't get any result because we don't have id in our example HTML code.

Comments

We develop web applications to our customers using python/django/angular.

Authors

ravi (23)
Balaji P (7)
sankar (1)
bhaskar (17)
srinath (13)
ganesh (11)

Feeds

RSS / Atom

Web Scraping using Python and BeautifulSoup

Posted by: ganesh 6 years, 9 months ago

Comments

Recent Posts

Archive

2022

2021

2020

2019

2018

2017

2016

2015

2014

Tags

Authors

Feeds