(Comments)
In this article, we’ll see how to perform web scraping using Python and the BeautifulSoup library.
We need Python and BeautifulSoup installed.
Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip.
The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.
easy_install beautifulsoup4 pip install beautifulsoup4
Consider an example html file with the below code:
<html> <head> <title>BeautifulSoup Tutorial</title> </head> <body> <p class="para1"> This is the first paragraph. </p> </body> </html>
Let us assume the url of this file is "https://www.example.com/example.html"
Python Requests Librabry:
We can download the page using the python requests library as shown below:
import requests page = requests.get("https://www.example.com/example.html")
If the request is successful, we will get a status code of 200.
page.status_code
page.content will have the HTML content of the file we downloaded.
page.content
BeautifulSoup:
Now we have html page we got using python requests. Let's see how to use BeautifulSoup Library to explore the HTML content.
from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser')
We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:
print(soup.prettify()) <!DOCTYPE html> <html> <head> <title>BeautifulSoup Tutorial</title> </head> <body> <p class="para1"> This is the first paragraph. </p> </body> </html>
The top level elements of the page can be selected using the children property of soup. Children returns a list generator, so we need to call the list function on it:
list(soup.children)
['html', '\n', <html> <head> <title>BeautifulSoup Tutorial</title> </head> <body> <p class="para1"> This is the first paragraph. </p> </body> </html>]
If we check the type of the elements, we will know that type of <html> is tag element.
type(list(soup.children))[2]
bs4.element.Tag
This the object we use to explore the HTML content and get the data from the HTML file.
html = list(soup.children)[2]
Note that the html is also a BeautifulSoup object. So, we can use children property on html object as well.
list(html.children)
['\n', <head> <title>BeautifulSoup Tutorial</title> </head>, '\n', <body> <p class="para1">This is the first paragraph.</p> </body>, '\n']
So in order to get the text in the <p></p> tags, we can use the html object as show below:
html = list(soup.children)[2] body = list(html.children)[3] p = list(body.children)[1] p.get_text()
This is the first paragraph.
We can also find all instances of a tag at once instead of going level by level as we did above. To find all instances of a tag, we can use the find_all method of the BeautifulSoup object.
We already have the soup object. Let's use the find_all method to get all instances of <p> tag.
soup.find_all('p')
[<p class="para1">This is the first paragraph.</p>]
Note that find_all method returns a list. To get only the first instance of the tag, we can use find method.
soup.find('p') soup.find('p').get_text()
<p class="para1">This is the first paragraph.</p> This is the first paragraph.
We can also search by using the class or id of the tag.
soup.find_all('p', class_='para1')
[<p class="para1">This is the first paragraph.</p>]
soup.find_all('p', id='id of tag')
We didn't get any result because we don't have id in our example HTML code.
We develop web applications to our customers using python/django/angular.
Contact us at hello@cowhite.com
Comments