(我将其发布到scrapy-users
邮件列表中,但根据Paul的建议,我将其发布在这里,因为它通过shell
命令交互来补充答案。)
通常,使用第三方服务来呈现某些数据可视化效果的网站(地图,表格等)必须以某种方式发送数据,并且在大多数情况下,可以从浏览器访问此数据。
对于这种情况,检查(即浏览浏览器发出的请求)显示数据已从POST请求加载到https://www.mcdonalds.com.sg/wp- admin/admin-ajax.PHP
因此,基本上,您已经准备好要使用的所有JSON格式的所需数据。
Scrapyshell
在编写蜘蛛程序之前,通过网站提供了非常方便思想者的命令:
$ scrapy shell https://www.mcdonalds.com.sg/locate-us/
2013-09-27 00:44:14-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: scrapybot)
...
In [1]: from scrapy.http import FormRequest
In [2]: url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.PHP'
In [3]: payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
In [4]: req = FormRequest(url, formdata=payload)
In [5]: fetch(req)
2013-09-27 00:45:13-0400 [default] DEBUG: Crawled (200) <POST https://www.mcdonalds.com.sg/wp-admin/admin-ajax.PHP> (referer: None)
...
In [6]: import json
In [7]: data = json.loads(response.body)
In [8]: len(data['stores']['listing'])
Out[8]: 127
In [9]: data['stores']['listing'][0]
Out[9]:
{u'address': u'678A Woodlands Avenue 6<br/>#01-05<br/>Singapore 731678',
u'city': u'Singapore',
u'id': 78,
u'lat': u'1.440409',
u'lon': u'103.801489',
u'name': u"McDonald's Admiralty",
u'op_hours': u'24 hours<br>\r\nDessert Kiosk: 0900-0100',
u'phone': u'68940513',
u'region': u'north',
u'type': [u'24hrs', u'dessert_kiosk'],
u'zip': u'731678'}
简而言之:在您的Spider中,您必须返回FormRequest(...)
上面的内容,然后在回调中从中加载json对象response.body
,最后为列表中每个商店的数据data['stores']['listing']
创建一个具有所需值的项目。
像这样:
class McDonaldSpider(BaseSpider):
name = "mcdonalds"
allowed_domains = ["mcdonalds.com.sg"]
start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]
def parse(self, response):
# This receives the response from the start url. But we don't do anything with it.
url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.PHP'
payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
return FormRequest(url, formdata=payload, callback=self.parse_stores)
def parse_stores(self, response):
data = json.loads(response.body)
for store in data['stores']['listing']:
yield McDonaldsItem(name=store['name'], address=store['address'])