Crawl Spider with Scrapy
29 Jan 2017Basic spider has been discussed in previous post, which allows us to simply scrape information on pages specified on start_urls. However, there might be a case where internal links should be followed and certain urls should be filtered.
This can be achieved with the help of CrawlSpider. Let’s take a look at steps to develop it.
- Open command prompt
- Navigate to the directory where you want to create scrapy project e.g. cd c:/users/yohan/documents/python27/projects
- Input
scrapy startproject <project_name>
that generates the following structure - Now I need to create a spider class and it will be resided in new ./<project_name>/spiders/crawlspider.py
- Let’s say I am scraping links in wikipedia page then attributes for CrawlSpider can be as below
name
is wikipediaallowed_domains
is wikipedia.orgstart_urls
is wikipedia page to start - https://en.wikipedia.org/wiki/Mathmetics
- In contrast to BaseSpider, CrawlSpider allows us to define a rule that internal links is followed. See options for LinkExtractor below
allow
that urls must match to be extracted (regular expression)restrict_xpaths
is an xpath which defines regions inside the response where links should be extracted from
As I am targeting all links within main body of wiki page, the rule can be as below
Rule(LinkExtractor(allow="https://en.wikipedia.org/wiki/", restrict_xpaths="//div[@class='mw-body']//a"), callback='parse_page', follow=False)
- See the entire CrwalSpider below
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class WikiSpider(CrawlSpider):
name = 'wikipedia'
allowed_domains = ['wikipedia.org']
start_urls = ["https://en.wikipedia.org/wiki/Mathmetics",]
rules = (
Rule(LinkExtractor(allow="https://en.wikipedia.org/wiki/", restrict_xpaths="//div[@class='mw-body']//a"), callback='parse_page', follow=False),
)
def parse_page(self, response):
item = WikiItem()
item["name"] = response.xpath('//h1[@class="firstHeading"]/text()').extract()
return item
- Input
scrapy crawl wikipedia (spider name) -o out.json
, Scrapy will make a call to the url and the response will be parsed
[
{"name": ["Wikipedia:Protection policy"]},
{"name": ["Giuseppe Peano"]},
{"name": ["Euclid's "]},
{"name": ["Greek mathematics"]}
...
]
See source code