03 July 2017
I will start this post with a necessary disclaimer. Scraping data from a search engine results page is almost always going to break the Terms of Service of the search provider; or at least I have yet to hear of a search engine which does not explicitly forbid the practice. In reality though, such Terms of Service probably only exist to deter those who wish to use the data to create a competing service which could potentially undermine the value of the scraped engine. If, however, one wishes to use the data for some other kind of endeavour and they don't abuse the request rate then doing so probably won't infuriate the provider. Nonetheless, I do warn you that if you run the code we share below you are doing it entirely at your own risk.
With all that said, today we are going to write a short python script that will send search requests to Bing with a HTTPS GET request. The script will then parse the HTML response and print out data of interest to the screen. For our script we will print out the titles and website descriptions of the results page.
To begin, we need a HTML parser. For this tutorial we will use the "BeautifulSoup" package. We can install the package with the command given below.
While we are on the subject of dependencies let's import the "urllib" and "urllib2" packages in addition to the parser package at the header of our Python script. We will need the other packages for our HTTPS request.
Let's now commence writing our scraping function by URL encoding our search query and concatenating it with the search engine domain.
Now, search engines will deny any search requests which do not appear to come from a browser so we will need to add the "User-agent" header to our GET request as we define it. For the tutorial we decided to emulate Mozilla.
We may now execute our GET request with the following lines of code.
The next step is to parse the response string into a BeautifulSoup object.
To carry on with our goal we need to know the structure of the HTML so we can hone into the elements of interest to us. If we run "print htmlResult" after receiving the response and carefully sift through what we see we will notice that at the time of publishing this post each individual search result has a structure like the example given below - we have removed some irrelevant attributes for brevity.
We are only after the "title" and "description" of each result so we may delete tags that are of no interest to us. From the output above we can see that the information of our interest is embedded inside the "h2" and "p" tags. We can use the "replaceWithChildren()" method to remove unwanted tags without deleting the text inside the tags.
Finally, we can achieve our goal by looping through each result snippet and selecting the text inside the "h2" and "p" tags.
Here is the code in full with a sample output shown below.
Thanks for reading.
Always,
Ruby Devices
Ruby Devices do not in any way condone the practice of illegal activities in relation to hacking. All teachings with regards to malware and other exploits are discussed for educational purposes only and are not written with the intention of malicious application.