基于终端指令的持久化存储
存储数据放到爬虫文件的parse方法的返回值中
。
存储只能为json
, csv
, xml
等文本类型。
scrapy crawl spider_name -o output_path
。执行spider_name
,将输出放到output_path
中。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import scrapyclass Spider1Spider (scrapy.Spider ): name = 'spider1' allowed_domains = ['www.bilibili.com' ] start_urls = ['https://www.bilibili.com/v/popular/rank/all' ] def parse (self, response ): selector_list = response.xpath('//li//div[@class="info"]/a/@href' ) data = selector_list.extract() return {"data" : data}
基于管道的持久化存储
前提 :创建工程project2
并创建爬虫文件spider2
。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import scrapyfrom project2.items import Project2Itemclass Spider2Spider (scrapy.Spider ): name = 'spider2' allowed_domains = ['www.bilibili.com' ] start_urls = ['https://www.bilibili.com/v/popular/rank/all' ] def parse (self, response ): selector_list = response.xpath('//li//div[@class="info"]/a/@href' ) data = selector_list.extract() item = Project2Item() item["rank_list" ] = ' ' .join(data) yield item
1 2 3 4 5 6 7 8 9 10 11 12 13 import scrapyclass Project2Item (scrapy.Item ): rank_list = scrapy.Field()
前提 :setting.py
中解注释ITEM_PIPELINES
。
1 2 3 4 ITEM_PIPELINES = { 'project2.pipelines.Project2Pipeline' : 300 , }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 from itemadapter import ItemAdapterclass Project2Pipeline : fp = None def open_spider (self, spider ): print ("开始爬虫" ) self.fp = open ('./bilibili_rank.txt' , 'w' , encoding='utf-8' ) def process_item (self, item, spider ): data = item["rank_list" ] self.fp.write(data) return item def close_spider (self, spider ): print ("结束爬虫" ) self.fp.close()
scrapy crawl spider2
开始爬取数据。
相比较于终端指令持久存储方式,使用管道方式存储方式更灵活。可以存在任何类型文件或者数据库中 。