Please wait...
HTML parsing in Python is very easy to handle and can also fetch data from any website according to requirement.
Beautiful Soup is also is a library that has a great method of HTML, XML parsing and it provides many concepts for filter data from the website.
Today I want to show some examples of this library. This library is great when you want to access many data continuous from big website for your project.
BeautifulSoup4
Now let’s move on with an example. My examples are all performed on https://stackoverflow.com/ website for accessing data directly without accessing website through Browsers.Firstly install library BeautifulSoup4 from https://pypi.python.org/pypi/beautifulsoup4 and requests https://pypi.python.org/pypi/requests
pip install beautifulsoup4 pip install requestsNow start code for scratch data from a website:
Get latest Questions of StackOverflow:import bs4 import requests res = requests.get('http://stackoverflow.com') res.raise_for_status() soup = bs4.BeautifulSoup(res.text, 'html.parser') latest_questions = soup.select('.question-summary > div.summary > h3 > a') users = soup.select('.question-summary > div.summary > div.started > a:nth-of-type(2)') status = soup.select('.question-summary > div.summary > div.started > a.started-link') for index, question in enumerate(latest_questions): print "Question {0}: {1}. {2} by {3}".format(index + 1, question.text.strip(), status[index].text.strip(),users[index].text.strip())
Output:
Question 1: Windows Authentication not working on jQuery Ajax calls. asked 43 secs ago by Kevin Purnama Question 2: How to configure a web app that uses IdentityServer4 via “Connect to an existing user store in the cloud”. asked 1 min ago by 001 Question 3: cannot install R tseries, quadprog ,xts packages in Linux. answered 1 min ago by Javier SG Question 4: Collections.sort(list, comparator) not reflecting the updated list when used with Jack and Java 1.8 on Android Studio.. asked 1 min ago by Chhiring Question 5: How to replace div content with canvas using html2canvas?. answered 1 min ago by akbansa Question 6: Does there exist a gem to parse human numbers?. asked 1 min ago by Chloe Question 7: Apache Server add client IP in a custom header before forwarding request. asked 1 min ago by prat8789 Question 8: gulp babel+browserfy not compiling the functions. asked 1 min ago by Dilakshan Sooriyanathan Question 9: Using Java Robot to capture a region of a window. asked 1 min ago by squishy3000 Question 10: unable to POST data after reading file uploaded by user on a web page. asked 2 mins ago by user3359184 Question 11: How to add an integer and an array element in ruby?. asked 2 mins ago by scratch_pad Question 12: I can't change ios status bar color in my xamarin forms project. modified 2 mins ago by Steve Chadbourne Question 13: Images and videos in product slider on Shopify site. asked 2 mins ago by pedz Question 14: Rotated image coordinates after scipy.ndimage.interpolation.rotate?. modified 3 mins ago by quantumflash Question 15: How to put these labels inside a div border?. asked 3 mins ago by Vahn Question 16: Running “JAVA” command including all jars in current folder. asked 3 mins ago by Fundi
Get Popular Tags of StackOverflow:import bs4 import requests res = requests.get('https://stackoverflow.com/tags') res.raise_for_status() soup = bs4.BeautifulSoup(res.text, 'html.parser') tags = soup.select('#tags-browser .post-tag') counts = soup.select('#tags-browser span.item-multiplier-count') for index, tag in enumerate(tags): print "Tag {0}: {1} X {2}".format(index + 1, tag.text.strip(), counts[index].text.strip())
Output:
Tag 1: javascript X 1485639 Tag 2: java X 1322695 Tag 3: c# X 1143847 Tag 4: php X 1129614 Tag 5: android X 1035863 Tag 6: jquery X 873735 Tag 7: python X 830159 Tag 8: html X 695707 Tag 9: c++ X 536604 Tag 10: ios X 532948 Tag 11: css X 498233 Tag 12: mysql X 487734
Get Top Users of StackOverflow:import bs4 import requests res = requests.get('https://stackoverflow.com/users?tab=Reputation&filter=all') res.raise_for_status() soup = bs4.BeautifulSoup(res.text, 'html.parser') users = soup.select('#user-browser .user-details > a') reputation = soup.select('#user-browser .user-details .reputation-score') for index, user in enumerate(users): print "User {0}: {1} - {2}".format(index + 1, user.text.strip(), reputation[index].text.strip())
Output:
User 1: Jon Skeet - 979k User 2: Darin Dimitrov - 759k User 3: BalusC - 756k User 4: Hans Passant - 719k User 5: VonC - 716k User 6: Marc Gravell - 693k User 7: CommonsWare - 677k User 8: SLaks - 604k User 9: Gordon Linoff - 594k User 10: Martijn Pieters - 591k
Get Top Users of Tag in StackOverflow:import bs4 import requests tag = raw_input('Enter tag name ') res = requests.get('https://stackoverflow.com/tags/' + tag + '/topusers') res.raise_for_status() soup = bs4.BeautifulSoup(res.text, 'html.parser') print "Top Answerers" users = soup.select('#questions div.fl')[1].select('.user-details a') point = soup.select('#questions div.fl')[1].select('tr') for index, user in enumerate(users): print "User {0}: {1} - {2}".format(index + 1, user.text.strip(), point[index].select('td:nth-of-type(1)')[0].text.strip()) print "\n\nTop Askers" users = soup.select('#questions div.fl')[3].select('.user-details a') point = soup.select('#questions div.fl')[3].select('tr') for index, user in enumerate(users): print "User {0}: {1} - {2}".format(index + 1, user.text.strip(), point[index].select('td:nth-of-type(1)')[0].text.strip())
Output:
Enter tag name django Top Answerers User 1: Daniel Roseman - 20.5k User 2: Alasdair - 9.3k User 3: Yuji 'Tomita' Tomita - 5.9k User 4: Chris Pratt - 5.4k User 5: Ignacio Vazquez-Abrams - 5.3k User 6: S.Lott - 3k Top Askers User 1: MikeN - 1.4k User 2: mpen - 1.1k User 3: TIMEX - 1.1k User 4: David542 - 910 User 5: Hellnar - 887 User 6: Roee Adler - 822
We can access data from HTML content using Python BeautifulSoup4 library. Python has many libraries that have been helpful to me many times.
Download above code in single file from GitHub.com https://gist.github.com/sainipray/594ebbe94f2518caf2afdb57bfb8691e