准备
xpath语法:https://www.w3school.com.cn/xpath/xpath_syntax.asp
预安装:lxml模块
目标:爬取豆瓣里某本书的若干页评论信息,并将评论信息存储为json文件。
JSON格式:
1 2 3 4 5 6 7
| { "userID":"rivocky", "itemID":"1", "rate":"4", "comment":"我读加缪第一本书,上来他就讨论自杀,让我觉得不是很high,然而在我看过了第二本,反与正之后,我改变了对他的看法——是要怀着对世界多大的爱才能勇敢的讨论这个问题,加缪是一个从心底到表面都善良的货", "timestamp":"2015-09-24" }
|
实例
引入并以HTML文本生成etree,使用etree.xpath()方法进行数据解析。etree相当于把HTML文本抽象成一个由若干文本节点组成的树,通过xpath指定目标的路径及属性等内容来锁定目标节点。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
| import requests import json from lxml import html
if __name__ == "__main__": etree = html.etree url = "https://book.douban.com/subject/24257403/comments/?start={:d}&limit=20&status=P&sort=new_score" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36" } start = 0 page_num = 3 file = open('output/comments.json', 'a', encoding="utf-8") dic = { "all": [] } for i in range(page_num): url_temp = url.format(start) html = requests.get(url=url_temp, headers=headers).text tree = etree.HTML(html) li_list = tree.xpath('//li[@class="comment-item"]') for li in li_list: try: userID = li.xpath('.//span[@class="comment-info"]/a/@href')[0].split('/')[-2] itemID = '1' rate = li.xpath('.//span[@title]/@class')[0].split()[-2][-2] comment = li.xpath('.//span[@class="short"]/text()')[0] date = li.xpath('.//span[@class="comment-time"]/text()')[0] except IndexError: continue else: dic["all"].append({ "userID": userID, "itemID": itemID, "rate": rate, "comment": comment, "timestamp": date })
start += 20
json.dump(dic, fp=file, ensure_ascii=False) file.close() print("爬虫结束!")
|