Scrapy是否可以与HTTP代理一起使用? 是。(从Scrapy 0.8开始)通过HTTP代理下载器中间件提供对HTTP代理的支持。请参阅HttpProxyMiddleware
。
使用代理的最简单方法是设置环境变量http_proxy
。如何完成取决于你的外壳。
C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port
如果你想使用https代理并访问https web,要设置环境变量,http_proxy
请遵循以下步骤:
C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port
单一代理
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
}
request = Request(url="http://example.com")
request.Meta['proxy'] = "host:port"
yield request
多个代理
class MySpider(BaseSpider):
name = "my_spider"
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']
def parse(self, response):
...parse code...
if something:
yield self.get_request(url)
def get_request(self, url):
req = Request(url=url)
if self.proxy_pool:
req.Meta['proxy'] = random.choice(self.proxy_pool)
return req