Please wait...

keyboard_arrow_up

HTML Parsing using BeautifulSoup4 library of Python

by in Programming
HTML Parsing using BeautifulSoup4 library of Python

HTML parsing in Python is very easy to handle and can also fetch data from any website according to requirement.

Beautiful Soup is also is a library
that has a great method of HTML, XML parsing and it provides many concepts for filter data from the website.

Today I want to show some examples of this library. This library is great when you want to access many data continuous from big website for your project.

BeautifulSoup4

Now let’s move on with an example. My examples are all performed on https://stackoverflow.com/ website for accessing data directly without accessing
website through Browsers.

Firstly install library BeautifulSoup4 from https://pypi.python.org/pypi/beautifulsoup4 and requests https://pypi.python.org/pypi/requests

pip install beautifulsoup4
pip install requests

Now start code for scratch data from a website:


Get latest Questions of StackOverflow:

import bs4
import requests

res = requests.get('http://stackoverflow.com')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latest_questions = soup.select('.question-summary > div.summary > h3 > a')
users = soup.select('.question-summary > div.summary > div.started > a:nth-of-type(2)')
status = soup.select('.question-summary > div.summary > div.started > a.started-link')
for index, question in enumerate(latest_questions):
        print "Question {0}: {1}. {2} by {3}".format(index + 1, question.text.strip(), status[index].text.strip(),users[index].text.strip())

 

Output:

Question 1: Windows Authentication not working on jQuery Ajax calls. asked 43 secs ago by Kevin Purnama
Question 2: How to configure a web app that uses IdentityServer4 via “Connect to an existing user store in the cloud”. asked 1 min ago by 001
Question 3: cannot install R tseries, quadprog ,xts packages in Linux. answered 1 min ago by Javier SG
Question 4: Collections.sort(list, comparator) not reflecting the updated list when used with Jack and Java 1.8 on Android Studio.. asked 1 min ago by Chhiring
Question 5: How to replace div content with canvas using html2canvas?. answered 1 min ago by akbansa
Question 6: Does there exist a gem to parse human numbers?. asked 1 min ago by Chloe
Question 7: Apache Server add client IP in a custom header before forwarding request. asked 1 min ago by prat8789
Question 8: gulp babel+browserfy not compiling the functions. asked 1 min ago by Dilakshan Sooriyanathan
Question 9: Using Java Robot to capture a region of a window. asked 1 min ago by squishy3000
Question 10: unable to POST data after reading file uploaded by user on a web page. asked 2 mins ago by user3359184
Question 11: How to add an integer and an array element in ruby?. asked 2 mins ago by scratch_pad
Question 12: I can't change ios status bar color in my xamarin forms project. modified 2 mins ago by Steve Chadbourne
Question 13: Images and videos in product slider on Shopify site. asked 2 mins ago by pedz
Question 14: Rotated image coordinates after scipy.ndimage.interpolation.rotate?. modified 3 mins ago by quantumflash
Question 15: How to put these labels inside a div border?. asked 3 mins ago by Vahn
Question 16: Running “JAVA” command including all jars in current folder. asked 3 mins ago by Fundi


Get Popular Tags of StackOverflow:

import bs4
import requests

res = requests.get('https://stackoverflow.com/tags')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
tags = soup.select('#tags-browser .post-tag')
counts = soup.select('#tags-browser span.item-multiplier-count')
for index, tag in enumerate(tags):
    print "Tag {0}: {1} X {2}".format(index + 1, tag.text.strip(), counts[index].text.strip())

 

Output:


Tag 1: javascript X 1485639
Tag 2: java X 1322695
Tag 3: c# X 1143847
Tag 4: php X 1129614
Tag 5: android X 1035863
Tag 6: jquery X 873735
Tag 7: python X 830159
Tag 8: html X 695707
Tag 9: c++ X 536604
Tag 10: ios X 532948
Tag 11: css X 498233
Tag 12: mysql X 487734


Get Top Users of StackOverflow:

import bs4
import requests

res = requests.get('https://stackoverflow.com/users?tab=Reputation&filter=all')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
users = soup.select('#user-browser .user-details > a')
reputation = soup.select('#user-browser .user-details .reputation-score')
for index, user in enumerate(users):
    print "User {0}: {1} - {2}".format(index + 1, user.text.strip(), reputation[index].text.strip()) 
        

 

Output:


User 1: Jon Skeet - 979k
User 2: Darin Dimitrov - 759k
User 3: BalusC - 756k
User 4: Hans Passant - 719k
User 5: VonC - 716k
User 6: Marc Gravell - 693k
User 7: CommonsWare - 677k
User 8: SLaks - 604k
User 9: Gordon Linoff - 594k
User 10: Martijn Pieters - 591k


Get Top Users of Tag in StackOverflow:


import bs4
import requests

tag = raw_input('Enter tag name ')
res = requests.get('https://stackoverflow.com/tags/' + tag + '/topusers')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print "Top Answerers"
users = soup.select('#questions div.fl')[1].select('.user-details a')
point = soup.select('#questions div.fl')[1].select('tr')
for index, user in enumerate(users):
  print "User {0}: {1} - {2}".format(index + 1, user.text.strip(),
                                           point[index].select('td:nth-of-type(1)')[0].text.strip())
print "\n\nTop Askers"
users = soup.select('#questions div.fl')[3].select('.user-details a')
point = soup.select('#questions div.fl')[3].select('tr')
for index, user in enumerate(users):
  print "User {0}: {1} - {2}".format(index + 1, user.text.strip(),
                                           point[index].select('td:nth-of-type(1)')[0].text.strip())
        

 

Output:


Enter tag name django
Top Answerers
User 1: Daniel Roseman - 20.5k
User 2: Alasdair - 9.3k
User 3: Yuji 'Tomita' Tomita - 5.9k
User 4: Chris Pratt - 5.4k
User 5: Ignacio Vazquez-Abrams - 5.3k
User 6: S.Lott - 3k

Top Askers
User 1: MikeN - 1.4k
User 2: mpen - 1.1k
User 3: TIMEX - 1.1k
User 4: David542 - 910
User 5: Hellnar - 887
User 6: Roee Adler - 822

We can access data from HTML content using Python BeautifulSoup4 library. Python has many libraries that have been helpful to me many times.
Download above code in single file from GitHub.com https://gist.github.com/sainipray/594ebbe94f2518caf2afdb57bfb8691e