How to Scrape Bing Search Results with Python


03 July 2017


Bing Art

I will start this post with a necessary disclaimer. Scraping data from a search engine results page is almost always going to break the Terms of Service of the search provider; or at least I have yet to hear of a search engine which does not explicitly forbid the practice. In reality though, such Terms of Service probably only exist to deter those who wish to use the data to create a competing service which could potentially undermine the value of the scraped engine. If, however, one wishes to use the data for some other kind of endeavour and they don't abuse the request rate then doing so probably won't infuriate the provider. Nonetheless, I do warn you that if you run the code we share below you are doing it entirely at your own risk.

With all that said, today we are going to write a short python script that will send search requests to Bing with a HTTPS GET request. The script will then parse the HTML response and print out data of interest to the screen. For our script we will print out the titles and website descriptions of the results page.

To begin, we need a HTML parser. For this tutorial we will use the "BeautifulSoup" package. We can install the package with the command given below.


[user]~$ pip install BeautifulSoup

While we are on the subject of dependencies let's import the "urllib" and "urllib2" packages in addition to the parser package at the header of our Python script. We will need the other packages for our HTTPS request.


from BeautifulSoup import BeautifulSoup
import urllib,urllib2

Let's now commence writing our scraping function by URL encoding our search query and concatenating it with the search engine domain.


def search(query):
    address = "http://www.bing.com/search?q=%s" % (urllib.quote_plus(query))

Now, search engines will deny any search requests which do not appear to come from a browser so we will need to add the "User-agent" header to our GET request as we define it. For the tutorial we decided to emulate Mozilla.


    getRequest = urllib2.Request(address, None, {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'})

We may now execute our GET request with the following lines of code.


    urlfile = urllib2.urlopen(getRequest)
    htmlResult = urlfile.read(200000)
    urlfile.close()

The next step is to parse the response string into a BeautifulSoup object.


    soup = BeautifulSoup(htmlResult)

To carry on with our goal we need to know the structure of the HTML so we can hone into the elements of interest to us. If we run "print htmlResult" after receiving the response and carefully sift through what we see we will notice that at the time of publishing this post each individual search result has a structure like the example given below - we have removed some irrelevant attributes for brevity.


<li class="b_algo">
   <h2><a href="https://en.wikipedia.org/wiki/William_Shakespeare"><strong>William Shakespeare</strong> - <strong>Wikipedia</strong></a></h2>
   <div class="b_caption">
      <div class="b_snippet">
         <div class="b_attribution"><cite>https://<strong>en.wikipedia.org</strong>/wiki/<strong>William_Shakespeare</strong></cite><span class="c_tlbxTrg"><span class="c_tlbxH"></span></span></div>
         <p>Early life. <strong>William Shakespeare</strong> was the son of John <strong>Shakespeare</strong>, an alderman and a successful glover originally from Snitterfield, and Mary Arden, the ...</p>
      </div>
      <div Class="sa_uc">
         <ul class="b_vList">
            <li class="b_annooverride">>
               <div class="b_factrow">Irrelevant Stuff Here</div>
            </li>
         </ul>
      </div>
   </div>
</li>

We are only after the "title" and "description" of each result so we may delete tags that are of no interest to us. From the output above we can see that the information of our interest is embedded inside the "h2" and "p" tags. We can use the "replaceWithChildren()" method to remove unwanted tags without deleting the text inside the tags.


    [s.extract() for s in soup('span')]
    unwantedTags = ['a', 'strong', 'cite']
    for tag in unwantedTags:
        for match in soup.findAll(tag):
            match.replaceWithChildren()

Finally, we can achieve our goal by looping through each result snippet and selecting the text inside the "h2" and "p" tags.


    results = soup.findAll('li', { "class" : "b_algo" })
        for result in results:
            print "# TITLE: " + str(result.find('h2')).replace(" ", " ") + "\n#"
            print "# DESCRIPTION: " + str(result.find('p')).replace(" ", " ")
            print "# ___________________________________________________________\n#"

Here is the code in full with a sample output shown below.


#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup
import urllib,urllib2

def search(query):
    address = "http://www.bing.com/search?q=%s" % (urllib.quote_plus(query))

    getRequest = urllib2.Request(address, None, {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'})

    urlfile = urllib2.urlopen(getRequest)
    htmlResult = urlfile.read(200000)
    urlfile.close()

    soup = BeautifulSoup(htmlResult)

    [s.extract() for s in soup('span')]
    unwantedTags = ['a', 'strong', 'cite']
    for tag in unwantedTags:
        for match in soup.findAll(tag):
            match.replaceWithChildren()

    results = soup.findAll('li', { "class" : "b_algo" })
        for result in results:
            print "# TITLE: " + str(result.find('h2')).replace(" ", " ") + "\n#"
            print "# DESCRIPTION: " + str(result.find('p')).replace(" ", " ")
            print "# ___________________________________________________________\n#"

    return results

if __name__=='__main__':
    links = search('Shakespeare')

## SAMPLE OUTPUT SHOWN BELOW ##

# TITLE: <h2>William Shakespeare - Wikipedia</h2>
#
# DESCRIPTION: <p>Early life. William Shakespeare was the son of John Shakespeare, an alderman and a successful glover originally from Snitterfield, and Mary Arden, the ...</p>
# ___________________________________________________________
#
# TITLE: <h2>William Shakespeare - Poet, Playwright - Biography.com</h2>
#
# DESCRIPTION: <p> Watch video William Shakespeare's works are known throughout the world, but his personal life is shrouded in mystery. Learn more at Biography.com.</p>
# ___________________________________________________________
#
# TITLE: <h2>Shakespeare</h2>
#
# DESCRIPTION: <p>Welcome to the Shakespeare Australia website. Browse the catalogue of fishing rods, reels, combos and accessories. Shakespeare’s reputation for quality and value is ...</p>
# ___________________________________________________________
#
# TITLE: <h2>The Complete Works of William Shakespeare</h2>
#
# DESCRIPTION: <p>Welcome to the Web's first edition of the Complete Works of William Shakespeare. This site has offered Shakespeare's plays and poetry to the Internet community …</p>
# ___________________________________________________________
#
# TITLE: <h2>Shakespeare Birthplace Trust - Official Site</h2>
#
# DESCRIPTION: <p>Caring for Shakespeare’s family homes in Stratford-upon-Avon, and celebrating his life & works through collections and educational programs.</p>
# ___________________________________________________________
#
# TITLE: <h2>Shakespeare's plays - Wikipedia</h2>
#
# DESCRIPTION: <p>Shakespeare's writing (especially his plays) also feature extensive wordplay in which double entendres and rhetorical flourishes are repeatedly used.</p>
# ___________________________________________________________
#
# TITLE: <h2>Absolute Shakespeare - plays, quotes, summaries, essays...</h2>
#
# DESCRIPTION: <p>Absolute Shakespeare, the essential resource for for William Shakespeare's plays, sonnets, poems, quotes, biography and the legendary Globe Theatre.</p>
# ___________________________________________________________
#
# TITLE: <h2>BBC - iWonder - William Shakespeare: The life and …</h2>
#
# DESCRIPTION: <p> Shakespeare's plays are known for their universal themes and insight into the human condition. Yet much about the playwright is a mystery.</p>
# ___________________________________________________________
#
# TITLE: <h2>William Shakespeare - British History - HISTORY.com</h2>
#
# DESCRIPTION: <p> Find out more about the history of William Shakespeare, including videos, interesting articles, pictures, historical features and more. Get all the facts on HISTORY.com</p>
# ___________________________________________________________
#
# TITLE: <h2>Shakespeare</h2>
#
# DESCRIPTION: <p>shakespeare.com - also known as The Shakespeare Web - has returned to its original home, after an absence of several years. It is undergoing a sea change, into ...</p>
# ___________________________________________________________
#

Thanks for reading.

Always,

Ruby Devices




Share This Post

Share on Facebook Share on Twitter Share on Google Plus Share on Linked In Share on Pinterest

Sign Up Below for Notifications on new Blog Posts


More from the Blog:


Exam Cheating Calculator


Ruby Devices do not in any way condone the practice of illegal activities in relation to hacking. All teachings with regards to malware and other exploits are discussed for educational purposes only and are not written with the intention of malicious application.