Web scraping using Python and BeautifulSoup
Intro
In the era of data science it is common to collect data from websites for analytics purposes.
Python is one of the most commonly used programming languages for data science projects. Using python with beautifulsoup makes web scrapping easier. Knowing how to scrap web pages will save your time and money.
Prerequisite
- Basics of python programming (
python3.x). - Basics of
htmltags.
Installing required modules
First thing first, assuming python3.x is already install on your system you need to install requests http library and beautifulsoup4 module.
Install requests and beautifulsoup4
$ pip install requests
$ pip install beautifulsoup4
Collecting web page data
Now we are ready to go. In this tutorial our goal is to get the list of presidents of United States from this wikipedia page.
Go to this link and right click on the table containing all the information about the United States presidents and then click on the inspect to inspect the page (I am using Chrome. Other browsers have similar option to inspect the page).
The table content is within the tag table and class wikitable (see the image below). We will need these information to extract the data of interest.
Import the installed modules
import requests
from bs4 import BeautifulSoup
To get the data from the web page we will use requests API's get() method
url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)
It is always good to check the http response status code
print(page.status_code) # This should print 200
Now we have collected the data from the web page, let's see what we got
print(page.content)
The above code will display the http response body.
The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
This will print data in format like we have seen when we inspected the web page.
<table class="wikitable" style="text-align:center;">
<tbody>
<tr>
<th colspan="9">
<span style="margin:0; font-size:90%; white-space:nowrap;">
<span class="legend-text" style="border:1px solid #AAAAAA; padding:1px .6em; background-color:#DDDDDD; color:black; font-size:95%; line-height:1.25; text-align:center;">
</span>
<a href="/wiki/Independent_politician" title="Independent politician">
Unaffiliated
</a>
(2)
</span>
<span style="margin:0; font-size:90%; white-space:nowrap;">
...
...
As of now we know that our table is in tag table and class wikitable. So, first we will extract the data in table tag using find method of bs4 object. This method returns a bs4 object
tb = soup.find('table', class_='wikitable')
This tag has many nested tags but we only need text under title element of the tag a of parent tag b (which is the child tag of table). For that we need to find all b tags under the table tag and then find all the a tags under the b tags. For this we will use find_all method and iterate over each of the b tag to get the a tag
for link in tb.find_all('b'):
name = link.find('a')
print(name)
This will extract data under all the a tags
<a href="/wiki/George_Washington" title="George Washington">George Washington</a>
<a href="/wiki/John_Adams" title="John Adams">John Adams</a>
<a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>
<a href="/wiki/James_Madison" title="James Madison">James Madison</a>
<a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>
...
...
<a href="/wiki/Barack_Obama" title="Barack Obama">Barack Obama</a>
<a href="/wiki/Donald_Trump" title="Donald Trump">Donald Trump</a>
The eleemnt title can be extracted from all a tags using the method get_text(). So modifyng the above code snippet
for link in tb.find_all('b'):
name = link.find('a')
print(name.get_text('title'))
and here is the desired result
George Washington
John Adams
Thomas Jefferson
James Monroe
...
...
Barack Obama
Donald Trump
Putting it all together
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tb = soup.find('table', class_='wikitable')
for link in tb.find_all('b'):
name = link.find('a')
print(name.get_text('title'))
We have successfully scrapped a web page in less than 10 lines of python code!! Bingo!
Leave a feedback in the comment box. Let me know if you have any questions in your mind or having any difficulty with this tutorial.

I am getting the desired output but followed by an attribute error non type object has no attribute get_text.
Any ideas?
It means that at some point in the code
link.find('a')returnsNone(meaning there was no<a>tag in that link object.) So you can’t.get_text()from something that doesn’t exist. I’d be moe than happy to help you out with this and any web scraping questions. An also, by the way, it would be far easier to usepandasin the example given above as it specifically parses<table>tags (using BeautifulSoup under the hood).Tks, dear!
Awesome!