Build Jekyll on Local Machine

Since I started running a github page via Jekyll, I have been wanting to build it on local machine before I push changes to repo.

  1. See this page for steps to run Jekyll project on local.
  2. See warnings you may encounter while following steps:
    • ‘json’ native gem requires installed build tools & Solution
    • GitHub Metadata Warning & Solution
    • OpenSSL default certificate error & Solution

Crawl Spider with Scrapy

Basic spider has been discussed in previous post, which allows us to simply scrape information on pages specified on start_urls. However, there might be a case where internal links should be followed and certain urls should be filtered.

This can be achieved with the help of CrawlSpider. Let’s take a look at steps to develop it.

  1. Open command prompt
  2. Navigate to the directory where you want to create scrapy project e.g. cd c:/users/yohan/documents/python27/projects
  3. Input scrapy startproject <project_name> that generates the following structure
  4. Now I need to create a spider class and it will be resided in new ./<project_name>/spiders/crawlspider.py
  5. Let’s say I am scraping links in wikipedia page then attributes for CrawlSpider can be as below
    • name is wikipedia
    • allowed_domains is wikipedia.org
    • start_urls is wikipedia page to start - https://en.wikipedia.org/wiki/Mathmetics
  1. In contrast to BaseSpider, CrawlSpider allows us to define a rule that internal links is followed. See options for LinkExtractor below
    • allow that urls must match to be extracted (regular expression)
    • restrict_xpaths is an xpath which defines regions inside the response where links should be extracted from

As I am targeting all links within main body of wiki page, the rule can be as below

Rule(LinkExtractor(allow="https://en.wikipedia.org/wiki/", restrict_xpaths="//div[@class='mw-body']//a"), callback='parse_page', follow=False)
  1. See the entire CrwalSpider below
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class WikiSpider(CrawlSpider):

    name = 'wikipedia'
    allowed_domains = ['wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/Mathmetics",]

    rules = (
        Rule(LinkExtractor(allow="https://en.wikipedia.org/wiki/", restrict_xpaths="//div[@class='mw-body']//a"), callback='parse_page', follow=False),
    )

    def parse_page(self, response):                
        item = WikiItem()
        item["name"] = response.xpath('//h1[@class="firstHeading"]/text()').extract()
        return item
  1. Input scrapy crawl wikipedia (spider name) -o out.json, Scrapy will make a call to the url and the response will be parsed
[
{"name": ["Wikipedia:Protection policy"]},
{"name": ["Giuseppe Peano"]},
{"name": ["Euclid's "]},
{"name": ["Greek mathematics"]}
...
]
Application: Crawl spider can be used to scrape information in different level of page with several consistant rules.

See source code

Basic Spider with Scrapy

This post will explain how to create new scrapy project using command prompt and text editor such as notepad++. Note that it assumes you have already installed scrapy package on your machine, and I am using Scrapy version 1.1.2. See steps below.

  1. Open command prompt
  2. Navigate to the directory where you want to create scrapy project e.g. cd c:/users/yohan/documents/python27/projects
  3. Input scrapy startproject <project_name> that generates the following structure
  • scrapy.cfg
  • <project_name>
    • _init__.py
    • items.py
    • pipelines.py
    • settings.py
    • spiders
      • _init__.py
  1. After that, if you look at ./<project_name> directory then you will find items.py. This will contain the class to hold the scraped information. As an example, I would like to scrape job title and link from seek.com (job listing site) then the item class will look as below in items.py
from scrapy.item import Item, Field
class JobItem(Item):
    title = Field()
    link = Field()
  1. Go ahead and create new file called spider.py under ./<project_name>/spiders folder.
    • I am going to create a SPIDER to crawl web pages and scrape info in this file. Spider will define initial url e.g. https://www.seek.com.au/jobs?keywords=software+engineer. It also defines how to follow links and pagination, and how extract and parse the field. Spider must define 3 attribute name, start url, parsing method. test.py has a spider class that includes the aforementioned attributes. This class has parsing method which takes the response of page call then parse info using xpath().
from scrapy import Spider
from scrapy.selector import Selector
from quick_scrapy.items import JobItem

class JobSpider(Spider):
name = 'jobs'
allow_domains = ['seek.com.au']
start_urls = [
    'https://www.seek.com.au/jobs?keywords=software+engineer'
    ]

    def parse(self, reponse):
        titles = reponse.xpath('//article')
        items = []

        for each in titles:
            item = JobItem()
            item["title"] = each.xpath('@aria-label').extract()
            item["link"] = each.xpath('h1/a/@href').extract()
            print item['title'], item['link']

        return items
  • When / is used at the beginning of path /a, it will define an absolute path to node ‘a’ from the root. When // is used at the beginning of path //a, it will define a path to node ‘a’ from from anywhere in xml response.
  1. Input scrapy crawl jobs (spider name) -o out.json, Scrapy will make a call to the url and the response will be parsed
[
{
    "link": ["/job/32678977?type=promoted&tier=no_tier&pos=1&whereid=3000&userqueryid=36d7d8b41696017af4c442da6bbf62e8-2435637&ref=beta"], 
    "title": ["Senior Software Engineer as Tester"]
},
{
    "link": ["/job/32635906?type=promoted&tier=no_tier&pos=2&whereid=3000&userqueryid=36d7d8b41696017af4c442da6bbf62e8-2435637&ref=beta"], 
    "title": ["Software Designer"]
},
{
    "link": ["/job/32691890?type=standard&tier=no_tier&pos=1&whereid=3000&userqueryid=36d7d8b41696017af4c442da6bbf62e8-2435637&ref=beta"], 
    "title": ["Mid-Level Software Developer (.Net)"]
},
...]
  • You can easily find xpath of html element as below

placeholder

Application: Basic spider with scrapy is a good candidate to scrape target information under certain element path e.g. h1/a/@href without any exceptions or rules, and also retrieve information without following internal links i.e. page depth = 1

See source code