For professional reasons, I was asked to fetch a lot of data from different webpages, so a crawler was the better way to do this. I was told about Scrapy, a good tool to do that.

My problem was that I found a lot of tutorials about Scrapy but none who were explaining step-by-step how to use it to fetch a list of links from a page, and use it to crawl information on each links.

You can find some other open-source crawlers on the Crawler Wikipedia page. Scrapy is written in Python, a language I don’t know (yet), and use XPath.
The first part of this tutorial will only be on a simple crawler, to crawl a group of single pages. Meaning it will not follow links from inside the crawled pages.

Installation (Ubuntu 12.04)

  • Start by adding the distrib in your distrib list
$ sudo nano /etc/apt/sources.list
## at the end of the file, add :
deb http://archive.scrapy.org/ubuntu precise main
## Or change "precise" by your distrib name
## save and quit. Then, run :
$ sudo apt-get update
  • Then add the public GPG key using :
sudo curl -s http://archive.scrapy.org/ubuntu/archive.key | sudo apt-key add -
  •  Install Scrapy
$ sudo apt-get install scrapy-0.17
## change the version by the last one.
## sudo apt-get install scrapy should display all version available.

Scrapy, if needed, will install python too.

Discover the pattern

Obviously, you need to know what you want to crawl. You need find a “pattern” in the website you want to crawl; that means you need to understand how the website can provide a “way” for the crawler to fetch wanted data.
Most of the time, it will be a page, listing a collection of URLs you want Scrapy to crawl. So, first step will be be fetch this listing.

Sometimes you will be lucky with this listing, sometimes, you won’t. For this first test, we will use an easy one : Deloitte.

The URL

Most of time, the listing-page’s url will be useful to fetch all your links. The page we need for Deloitte has only one page, so we will not have to manage this, but it should have been like this :

http://www.deloitte.com/view/fr_FR/fr/technology-fast-50/palmares/palmares-national/p1/index.htm

Even if we already gess the use of some parameters, the best way is to compare the first page url to the next one by clicking to go to the next page and watch the URL :

http://www.deloitte.com/view/fr_FR/fr/technology-fast-50/palmares/palmares-national/p2/index.htm

Obviously for this example the only thing who change is the “p” value; “p” means “page” here. A “L” could have represented the limit to display. A good exercise to check is to try.

With this, we can take a certain control of the data we are looking for.

Ajax

Data could be loaded its listing using Ajax. It will be much easier to use the file called by ajax directly. To find it out, we will use developer tool providing by Chrome or Firefox when F12 key is pressed:

  • press F12 and then, go to “Network” and below, select “XHR”. This is javascript calls.
  • reload the page (using F5) and wait. A lot of line will be added to Network, filtred by XHR calls.
  • after the page is loaded, you will have to find which file is the one.
  • right-click on it and “open in a new tab”. If it’s a JSON file, this is even better than a HTML one, you just have to retrieve the column you want, using an online tool like : http://jsonviewer.stack.hu/.

But, this is particular. So I will continue this tutorial like the listing page is HTML.

The data

SEO needs will really help us to crawl them, as websites – to be search-engine compliant – have to display data with a pattern. The same pattern Scrapy will need.

Defining a new project

Now we have our listing, let’s start with Scrapy : you need to create a new project. A project, for scrapy, is a crawler. I make one project for each website to crawl.

Note : you may want to the the following in a particular directory

$ scrapy startproject deloitte_listing

This will automatically create a folder named “deloitte_listing” with some files in it.

Items

Items are objects – or informations – to catch. They are defined in deloitte_listing/deloitte_listing/items.py
For theis tutorial, we will need three informations : the name, the URL and revenue:

$ cd deloitte_listing
$ nano deloitte_listing/items.py
from scrapy.item import Item, Field

class DeloitteListingItem(Item):
    # define the fields for your item here like:
    url = Field()
    name = Field()
    ca = Field()

That’s all. This is the way we declare our items. The name of the item (url, name…) is important and should describe what will be store in it.

The spider

Spiders are bots who will crawl your pages. In a spider, you will define which url(s) it will crawl and how to parse the contents of those pages to extracts defined items. You have to create the file :

$ nano oseo_listing/spiders/oseo_listing_spider.py

Then, add your code in it :

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from deloitte_listing.items import DeloitteListingItem

class DeloitteListingSpider(BaseSpider):

    name = "deloitte_listing"
    allowed_domains = ["deloitte.com"]

    start_urls = [
        "http://www.deloitte.com/view/fr_FR/fr/technology-fast-50/palmares/palmares-national/index.htm",
    ]

    def parse(self, response):

        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//table[@class="custom_table"]/tr')

        items = []

        for site in sites:

            #print site
            item = DeloitteListingItem()

            name = ''.join(site.select('./td/a/text()').extract())
            url = ''.join(site.select('./td/a/@href').extract())
            ca = ''.join(site.select('./td[4]/text()').extract())

            item['name'] = name
            item['url'] = url
            item['ca'] = ca

            items.append(item)

        return items

Explanation

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

This is python definition of the class. Should not be changed.

from deloitte_listing.items import DeloitteListingItem

This define our items, as we declared them into our items.py file.

Next, the class. This define what the spider will do :

name = "deloitte_listing"
allowed_domains = ["deloitte.com"]

The name of the spider and a list of domains where the crawler is allowed to go. This is important for spider who will follow links to crawl more pages, to prevent it to get lost in a link of a link of a link of link of link…. In our present case, it is not so important, but keep it like that.

start_urls = [
        "http://www.deloitte.com/view/fr_FR/fr/technology-fast-50/palmares/palmares-national/index.htm",
    ]

This is a list of URLs the spider will read. It could have been made of several URLs, like if the listing is broke in many pages.

XPath

sites = hxs.select('//table[@class="custom_table"]/tr')

This is starting to be fun.
The hxs var contain all lines from the html. Inside it, we only need to look into the table displaying our targeted data.
Scrapy uses XPath to define what to catch. You can easily get an XPath of what you want using developer tools on Chrome or Firefox. Right-click on the element you want, then “Inspect”. On the window who appears, right-click on the HTML element you want and “Copy XPath”. It will display something like : //*[@id=”userstable-display”]/li[1]/div[3]/span/strong/a

Then, sites array will contains all line matching the XPath. In this case: all <tr> content.

For each table lines, we will look for our data, like :

name = ''.join(site.select('./td/a/text()').extract())

We say to Scrapy to select, into our ‘site’ the td (html table column), then, into the a (html tag for links) and with ‘text()’, we tell it what we want is what is inside the tags <a> and </a>. Then, extract() will clean the data to keep the string, in this case, the name of the company.

Scrapy shell can be used to test your XPath. Run the command:

$ scrapy shell http://www.deloitte.com/view/fr_FR/fr/technology-fast-50/palmares/palmares-national/index.htm

Wait a second or two, then Scrapy will be waiting for you. Try a simple command, just to be sure the crawler has the good page:

>>> hxs.select('//title/text()').extract()

Should return the title of the page, with something like [u before and ] after. This is normal, and indicate it is a string. then, try some of your XPath to be sure they work.

Be careful, browser like Chrome may add html tags into the code display by the developer tools (F12). Be sure to look the “real” code (Ctrl+u) to have the “real” succession of tags.

Running the spider

$ scrapy crawl deloitte_listing -o deloitte_result.csv -t csv

Will create an deloitte_result.csv file in the current directory, containing results from your spider.

Note that if you url listing is in a text file, you can use it directly, replacing :

start_urls = [
                "http://www.deloitte.com/view/fr_FR/fr/technology-fast-50/palmares/palmares-national/index.htm",
        ]

By :

f = open("/home/scrapy/path/to/the/file/listing_total.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()

This should be enough. Soon, a better tutorial to make Scrapy follow links.