讲解Python的Scrapy爬虫框架使用代理进行采集的方法
1.在Scrapy工程下新建“middlewares.py”
#Importingbase64librarybecausewe'llneeditONLYincaseiftheproxywearegoingtouserequiresauthentication importbase64 #Startyourmiddlewareclass classProxyMiddleware(object): #overwriteprocessrequest defprocess_request(self,request,spider): #Setthelocationoftheproxy request.meta['proxy']="http://YOUR_PROXY_IP:PORT" #Usethefollowinglinesifyourproxyrequiresauthentication proxy_user_pass="USERNAME:PASSWORD" #setupbasicauthenticationfortheproxy encoded_user_pass=base64.encodestring(proxy_user_pass) request.headers['Proxy-Authorization']='Basic'+encoded_user_pass
2.在项目配置文件里(./project_name/settings.py)添加
DOWNLOADER_MIDDLEWARES={
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':110,
'project_name.middlewares.ProxyMiddleware':100,
}
只要两步,现在请求就是通过代理的了。测试一下^_^
fromscrapy.spiderimportBaseSpider
fromscrapy.contrib.spidersimportCrawlSpider,Rule
fromscrapy.httpimportRequest
classTestSpider(CrawlSpider):
name="test"
domain_name="whatismyip.com"
#Thefollowingurlissubjecttochange,youcangetthelastupdatedonefromhere:
#http://www.whatismyip.com/faq/automation.asp
start_urls=["http://xujian.info"]
defparse(self,response):
open('test.html','wb').write(response.body)
3.使用随机user-agent
默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user-agent的列表中随机选择一个来采集不同的页面
在settings.py中添加以下代码
DOWNLOADER_MIDDLEWARES={
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':None,
'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware':400
}
注意:Crawler;是你项目的名字,通过它是一个目录的名称下面是蜘蛛的代码
#!/usr/bin/python
#-*-coding:utf-8-*-
importrandom
fromscrapy.contrib.downloadermiddleware.useragentimportUserAgentMiddleware
classRotateUserAgentMiddleware(UserAgentMiddleware):
def__init__(self,user_agent=''):
self.user_agent=user_agent
defprocess_request(self,request,spider):
#这句话用于随机选择user-agent
ua=random.choice(self.user_agent_list)
ifua:
request.headers.setdefault('User-Agent',ua)
#thedefaultuser_agent_listcomposeschrome,IE,firefox,Mozilla,opera,netscape
#formoreuseragentstrings,youcanfinditinhttp://www.useragentstring.com/pages/useragentstring.php
user_agent_list=[\
"Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.1(KHTML,likeGecko)Chrome/22.0.1207.1Safari/537.1"\
"Mozilla/5.0(X11;CrOSi6862268.111.0)AppleWebKit/536.11(KHTML,likeGecko)Chrome/20.0.1132.57Safari/536.11",\
"Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.6(KHTML,likeGecko)Chrome/20.0.1092.0Safari/536.6",\
"Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.6(KHTML,likeGecko)Chrome/20.0.1090.0Safari/536.6",\
"Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/537.1(KHTML,likeGecko)Chrome/19.77.34.5Safari/537.1",\
"Mozilla/5.0(X11;Linuxx86_64)AppleWebKit/536.5(KHTML,likeGecko)Chrome/19.0.1084.9Safari/536.5",\
"Mozilla/5.0(WindowsNT6.0)AppleWebKit/536.5(KHTML,likeGecko)Chrome/19.0.1084.36Safari/536.5",\
"Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3",\
"Mozilla/5.0(WindowsNT5.1)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3",\
"Mozilla/5.0(Macintosh;IntelMacOSX10_8_0)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3",\
"Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1062.0Safari/536.3",\
"Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1062.0Safari/536.3",\
"Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3",\
"Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3",\
"Mozilla/5.0(WindowsNT6.1)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3",\
"Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.0Safari/536.3",\
"Mozilla/5.0(X11;Linuxx86_64)AppleWebKit/535.24(KHTML,likeGecko)Chrome/19.0.1055.1Safari/535.24",\
"Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/535.24(KHTML,likeGecko)Chrome/19.0.1055.1Safari/535.24"
]