Python爬虫-scrapy反爬策略

发表于2021-05-02|更新于2022-05-12|编程Python爬虫

|阅读量:

UA伪装和Cookie设置

在DownloaderMiddleware的process_request中设置request.headers['User-Agent']和request.cookies
在setting.py中设置

代理IP设置

在DownloaderMiddleware的process_request和process_exception中设置request.meta['proxy']。
在setting.py中设置

Download时延

对目标网站的爬取速度不应太快，否则很容易被封IP。所以要设置时延。

settings.py中设置DOWNLOAD_DELAY（运行时默认使用0.5DOWNLOAD_DELAY到1.5DOWNLOAD_DELAY之间的值）。并开启AUTOTHROTTLE_ENABLED根据网站负载动态调整下载速度。

AutoThrottle extension Design goals:

be nicer to sites instead of using default download delay of zero

automatically adjust Scrapy to the optimum crawling speed, so the user doesn’t have to tune the download delays to find the optimum one. The user only needs to specify the maximum concurrent requests it allows, and the extension does the rest.

文章作者: Safe Animal

文章链接: https://safeanimal.github.io/2021/05/02/Python/爬虫/2021-05-02-Python爬虫-scrap常用反爬策略实现/

版权声明: 本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Silent Wittgenstein！

打赏

微信
支付宝

相关推荐

Python爬虫-scrapy基本使用

Python爬虫-scrapy数据持久化存储

Python爬虫-scrapy框架安装

Python爬虫-scrapy五大核心组件

Python爬虫-scrapy使用middlewares设置headers和代理IP

Python爬虫-scrapy中使用CrawlSpider进行全站数据爬取

本地搜索