scrapy多个items类piplines如何分别存储

很多情况下,在一个爬虫里面需要存储不同的表,那么相应的在items.py里定义不同的类定义不同的字段,在scrapy里如何实现呢?

一、items.py定义

items里定义不同类,不同的字段,每个类里是对应的是每个表的字段

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from scrapy import Item, Field

class JsTotalItem(Item):
cate_url = Field()
cate_id = Field()
title_nums = Field()
title_req_page = Field()
fans_nums = Field()
fans_req_page = Field()
crawl_time = Field()

class AuthorItem(Item):
author = Field()
author_id = Field()
sign = Field()
focus_num = Field()
fans_num = Field()
article_num = Field()
word_num = Field()
like_num = Field()
crawl_time = Field()

二、piplines如何分别存储

爬虫主程序里分别导入items.py里两个类实例化以后,进行了存储,yield item提交给pipline处理,pipline如何进行分别存储呢?用if isinstance(item, items.py里面的类)进行判断就可以了。

CXYMysql是我操作数据库的类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from jianshutotal.class_mysql import CXYMysql
from jianshutotal.items import JsTotalItem, AuthorItem

cate_total_list = ['cate_url', 'cate_id', 'title_nums', 'title_req_page', 'fans_nums', 'fans_req_page',
'crawl_time']
author_total_list = ['author', 'author_id', 'sign', 'focus_num', 'fans_num', 'article_num',
'word_num', 'like_num', 'crawl_time']

class JianshutotalPipeline(object):
def process_item(self, item, spider):
if isinstance(item, JsTotalItem):
cate_total = CXYMysql('cate_total', cate_total_list, len(cate_total_list))
items = dict(item)
print(items)
item_cate = {}
item_cate[1] = items['cate_url']
item_cate[2] = items['cate_id']
item_cate[3] = items['title_nums']
item_cate[4] = items['title_req_page']
item_cate[5] = items['fans_nums']
item_cate[6] = items['fans_req_page']
item_cate[7] = items['crawl_time']
cate_total.insert(item_cate, db='local_db')
elif isinstance(item, AuthorItem):
author_total = CXYMysql('author_total', author_total_list, len(author_total_list))
items = dict(item)
print(items)
item_author_total = {}
item_author_total[1] = items['author']
item_author_total[2] = items['author_id']
item_author_total[3] = items['sign']
item_author_total[4] = items['focus_num']
item_author_total[5] = items['fans_num']
item_author_total[6] = items['article_num']
item_author_total[7] = items['word_num']
item_author_total[8] = items['like_num']
item_author_total[9] = items['crawl_time']
author_total.insert(item_author_total, db='local_db')