Image from http://umreviews.com/what-is-search-engine-crawler/

Scrapy : How to crawl pages from a listing page

For professional reasons, I was asked to fetch a lot of data from different webpages, so a crawler was the better way to do this. I was told about Scrapy, a good tool to do that.

My problem was that I found a lot of tutorials about Scrapy but none who were explaining step-by-step how to use it to fetch a list of links from a page, and use it to crawl information on each links.

You can find some other open-source crawlers on the Crawler Wikipedia page. Scrapy is written in Python, a language I don’t know (yet), and use XPath.
The first part of this tutorial will only be on a simple crawler, to crawl a group of single pages. Meaning it will not follow links from inside the crawled pages.

Installation (Ubuntu 12.04)

  • Start by adding the distrib in your distrib list

  • Then add the public GPG key using :

  •  Install Scrapy

Scrapy, if needed, will install python too.

Discover the pattern

Obviously, you need to know what you want to crawl. You need find a “pattern” in the website you want to crawl; that means you need to understand how the website can provide a “way” for the crawler to fetch wanted data.
Most of the time, it will be a page, listing a collection of URLs you want Scrapy to crawl. So, first step will be be fetch this listing.

Sometimes you will be lucky with this listing, sometimes, you won’t. For this first test, we will use an easy one : Deloitte.

The URL

Most of time, the listing-page’s url will be useful to fetch all your links. The page we need for Deloitte has only one page, so we will not have to manage this, but it should have been like this :

Even if we already gess the use of some parameters, the best way is to compare the first page url to the next one by clicking to go to the next page and watch the URL :

Obviously for this example the only thing who change is the “p” value; “p” means “page” here. A “L” could have represented the limit to display. A good exercise to check is to try.

With this, we can take a certain control of the data we are looking for.

Ajax

Data could be loaded its listing using Ajax. It will be much easier to use the file called by ajax directly. To find it out, we will use developer tool providing by Chrome or Firefox when F12 key is pressed:

  • press F12 and then, go to “Network” and below, select “XHR”. This is javascript calls.
  • reload the page (using F5) and wait. A lot of line will be added to Network, filtred by XHR calls.
  • after the page is loaded, you will have to find which file is the one.
  • right-click on it and “open in a new tab”. If it’s a JSON file, this is even better than a HTML one, you just have to retrieve the column you want, using an online tool like : http://jsonviewer.stack.hu/.

But, this is particular. So I will continue this tutorial like the listing page is HTML.

The data

SEO needs will really help us to crawl them, as websites – to be search-engine compliant – have to display data with a pattern. The same pattern Scrapy will need.

Defining a new project

Now we have our listing, let’s start with Scrapy : you need to create a new project. A project, for scrapy, is a crawler. I make one project for each website to crawl.

Note : you may want to the the following in a particular directory

This will automatically create a folder named “deloitte_listing” with some files in it.

Items

Items are objects – or informations – to catch. They are defined in deloitte_listing/deloitte_listing/items.py
For theis tutorial, we will need three informations : the name, the URL and revenue:

That’s all. This is the way we declare our items. The name of the item (url, name…) is important and should describe what will be store in it.

The spider

Spiders are bots who will crawl your pages. In a spider, you will define which url(s) it will crawl and how to parse the contents of those pages to extracts defined items. You have to create the file :

Then, add your code in it :

Explanation

This is python definition of the class. Should not be changed.

This define our items, as we declared them into our items.py file.

Next, the class. This define what the spider will do :

The name of the spider and a list of domains where the crawler is allowed to go. This is important for spider who will follow links to crawl more pages, to prevent it to get lost in a link of a link of a link of link of link…. In our present case, it is not so important, but keep it like that.

This is a list of URLs the spider will read. It could have been made of several URLs, like if the listing is broke in many pages.

XPath

This is starting to be fun.
The hxs var contain all lines from the html. Inside it, we only need to look into the table displaying our targeted data.
Scrapy uses XPath to define what to catch. You can easily get an XPath of what you want using developer tools on Chrome or Firefox. Right-click on the element you want, then “Inspect”. On the window who appears, right-click on the HTML element you want and “Copy XPath”. It will display something like : //*[@id=”userstable-display”]/li[1]/div[3]/span/strong/a

Then, sites array will contains all line matching the XPath. In this case: all <tr> content.

For each table lines, we will look for our data, like :

We say to Scrapy to select, into our ‘site’ the td (html table column), then, into the a (html tag for links) and with ‘text()’, we tell it what we want is what is inside the tags <a> and </a>. Then, extract() will clean the data to keep the string, in this case, the name of the company.

Scrapy shell can be used to test your XPath. Run the command:

Wait a second or two, then Scrapy will be waiting for you. Try a simple command, just to be sure the crawler has the good page:

Should return the title of the page, with something like [u before and ] after. This is normal, and indicate it is a string. then, try some of your XPath to be sure they work.

Be careful, browser like Chrome may add html tags into the code display by the developer tools (F12). Be sure to look the “real” code (Ctrl+u) to have the “real” succession of tags.

Running the spider

Will create an deloitte_result.csv file in the current directory, containing results from your spider.

Note that if you url listing is in a text file, you can use it directly, replacing :

By :

This should be enough. Soon, a better tutorial to make Scrapy follow links.

Published by

Constantin Guay

Currently Data Project Manager and Scrum Evangelist at NetMediaEurope, european leader in IT B2B news site (more than 12 million unique visitors per month). Passionate by user experience and data science. Scrum evangelist

11 thoughts on “Scrapy : How to crawl pages from a listing page”

    1. Hi Jòn,

      It was to be sure to have a string has returned value. I am absolutely not fluent in Python (it was the first project where I used it) so there may be another -better- way to do it.

      1. Right, same here. I´m just curious how your dict looked like after you assigned the scraped values. I´ve been stuck for days with this problem:

        {

        ‘name’: [u’person1′, u’person2′, u’person3′],
        ‘id’: [1, 2, 3]

        }

        It loads the scraped data as a list and assignes the values in the list as one Item. Instead of:

        {
        [name: ‘person1’, id:1],
        [name: ‘person2’, id:2]
        }

          1. yes, well, my problem has nothing to do with your code actually, although mine is similar. I just haven´t found a solution to this issue anywhere and all tutorials I have read are doing more or less like we are doing it.

          2. Hi Constantin,

            This tutorial is really good for basic thing..

            Can you explain me on how javascript calls i mean the steps on how they will be called??

            I have a site where I have crawled all the pages using post requests and headers, but there is one point where we need to get the description which is available in pdf or docx format. When tried with inspect element I see that it is calling a javascript function. I am not sure where it is leading to. Could you please help me in this???

            I would even like to know if that pdf can be read online and can apply xpaths to it??

          3. Hello and thank you.

            I’ve never tried to crawl into a PDF file, but I’m sure that Scrapy can do that, maybe with some help from a plugin. I won’t have many times to look into that right now, but if you find something on your side, please share it here.

          4. Yes, doing that.

            I think I have to loop through the list and assign the item (title and post date) in each loop. Haven´t tried it yet, kind of busy in another project right now. But I´ll post my results.

Leave a Reply

Your email address will not be published. Required fields are marked *