UA伪装和Cookie设置

  1. DownloaderMiddlewareprocess_request中设置request.headers['User-Agent']request.cookies
  2. setting.py中设置

代理IP设置

  1. DownloaderMiddlewareprocess_requestprocess_exception中设置request.meta['proxy']
  2. setting.py中设置

Download时延

对目标网站的爬取速度不应太快,否则很容易被封IP。所以要设置时延。

  1. settings.py中设置DOWNLOAD_DELAY(运行时默认使用0.5DOWNLOAD_DELAY到1.5DOWNLOAD_DELAY之间的值)。并开启AUTOTHROTTLE_ENABLED根据网站负载动态调整下载速度。

AutoThrottle extension Design goals:

  1. be nicer to sites instead of using default download delay of zero
  2. automatically adjust Scrapy to the optimum crawling speed, so the user doesn’t have to tune the download delays to find the optimum one. The user only needs to specify the maximum concurrent requests it allows, and the extension does the rest.