2018-11-15

Scrapy使用技巧集锦

1. 使用 response.follow
2. 通过正则表达式使用 selectors (.re())
3. 如果是通过 class 进行查询，使用 CSS
4. 使用链接提取器过滤图片地址

使用 response.follow

不同于使用 scrapy Request，需要通过相对路径构造出绝对路径，response.follow 可以直接使用相对路径，因此就不需要调用 urljoin() 方法了；注意，response.follow 直接返回一个 Request 实例，可以直接通过 yield 进行返回；
所以，代码:

1
2
3

if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

这一部分可以简化为

import scrapy
class QuotesSpider(scrapy.Spider):
   name = "quotes"
   start_urls = [
       'http://quotes.toscrape.com/page/1/',
   ]
   def parse(self, response):
       for quote in response.css('div.quote'):
           yield {
               'text': quote.css('span.text::text').extract_first(),
               'author': quote.css('span small::text').extract_first(),
               'tags': quote.css('div.tags a.tag::text').extract(),
           }
       next_page = response.css('li.next a::attr(href)').extract_first()
       if next_page is not None:
           yield response.follow(next_page, callback=self.parse)

另外，response.follow 在处理 <a> 元素的时候，会直接使用它们的 href 属性；所以上述代码还可以简化为：

1
2
3

next_page = response.css('li.next a').extract_first()
if next_page is not None:
   yield response.follow(next_page, callback=self.parse)

因此匹配的时候不需要显示的声明 <a> 的属性值了；

在爬取整个网站信息的时候，必然会有多个相同 Author 的名言，那么势必要爬取到许多的重复的 Author 的信息；这无疑是增加了爬取的压力同时也需要处理大量的冗余数据，基于此，Scrapy 默认实现了对重复的已经爬取过的链接在下次爬取的时候自动过滤掉了；不过，你也可以通过 DUPEFILTER_CLASS 来进行设置是否启用该默认行为。

如果你只想去提取第一个相匹配的元素，你可以直接使用extract_first()

1 2	>>> response.xpath('//div[@id="images"]/a/text()').extract_first() u'Name: My image 1 '

如果没有找到对应的元素，将会返回None

1 2	>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None True

可以设置一个 default value 来取代None

1 2	>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found') 'not-found'

通过正则表达式使用 `selectors (.re())`

Selector 有一个.re()方法，通过该方法可以使用正则表达式来提取数据；但是，不同于.xpath()和.css()的是，.re()返回的是一个 unicode strings 的列表；所以，你不能像 Nesting selectors 那样构建.re()的嵌入式调用方式；
下面的这个例子将会演示如何从上面的 HTML 代码片段中获取 image names

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
u'My image 2',
u'My image 3',
u'My image 4',
u'My image 5']

如果使用的是ItemLoader

1 2	item_loader.add_xpath( 'img', './div[@class="mask-banner"]/@style',re='background-image:url\((.*)\);')

如果是通过 class 进行查询，使用 CSS

因为一个元素可以包含多个 CSS classes，这种情况下使用 XPath 将会显得非常的麻烦
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
如果你使用@class='someclass'，你将可能失去其它的 classes，如果你仅仅使用contains(@class, 'someclass')那么你可能会得到更多的但并不需要的元素；
不过，Scrapy selectors 允许你使用 selectors 所组成的链条 (chain)，所以大多数的时候，你可以先使用 CSS 检索 class 的方式，当需要的时候，再转换到使用 XPath 的方式；

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').extract()
[u'2014-07-23 19:00']

这比使用 xpath 的方式要简单许多，但是要记住的是，在 XPath 表达式中使用.表示紧随上一个匹配的元素，这里指的就是<div class="hero shout/>。

使用链接提取器过滤图片地址

1 2	le = LinkExtractor( restrict_xpaths='//div[contains(@class, "r_img")]',tags=('a', 'area', 'img'), attrs=('href', 'src'), deny_extensions='')

注意，deny_extensions一定要加上，否则抓取不到图片

摘自：爬虫 Scrapy 学习系列

使用 response.follow

通过正则表达式使用 selectors (.re())

如果是通过 class 进行查询，使用 CSS

使用链接提取器过滤图片地址

通过正则表达式使用 `selectors (.re())`