python爬虫开发之urllib模块详细使用方法与实例全解

2023-07-29 13:59:04 435

爬虫所需要的功能，基本上在urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。

首先

在Pytho2.x中使用importurllib2——-对应的，在Python3.x中会使用importurllib.request，urllib.error

在Pytho2.x中使用importurllib——-对应的，在Python3.x中会使用importurllib.request，urllib.error，urllib.parse

在Pytho2.x中使用importurlparse——-对应的，在Python3.x中会使用importurllib.parse

在Pytho2.x中使用importurlopen——-对应的，在Python3.x中会使用importurllib.request.urlopen

在Pytho2.x中使用importurlencode——-对应的，在Python3.x中会使用importurllib.parse.urlencode

在Pytho2.x中使用importurllib.quote——-对应的，在Python3.x中会使用importurllib.request.quote

在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar

在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request

urllib是Python自带的标准库，无需安装，直接可以用。

urllib模块提供了如下功能：

网页请求(urllib.request)
URL解析(urllib.parse)
代理和cookie设置
异常处理(urllib.error)
robots.txt解析模块(urllib.robotparser)

urllib包中urllib.request模块

1、urllib.request.urlopen

urlopen一般常用的有三个参数，它的参数如下：

r=urllib.requeset.urlopen(url,data,timeout)

url：链接格式：协议://主机名:[端口]/路径

data：附加参数必须是字节流编码格式的内容(bytes类型)，可通过bytes()函数转化，如果要传递这个参数，请求方式就不再是GET方式请求，而是POST方式

timeout:超时单位为秒

get请求

importurllib
r=urllib.urlopen('//www.nhooo.com/')
datatLine=r.readline()#读取html页面的第一行
data=file.read()#读取全部
f=open("./1.html","wb")#网页保存在本地
f.write(data)
f.close()

urlopen返回对象提供方法：

read(),readline(),readlines(),fileno(),close()：这些方法的使用方式与文件对象完全一样info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到geturl()：返回请求的url

urllib.quote(url)和urllib.quote_plus(url)，对关键字进行编码可使得urlopen能够识别

POST请求

importurllib.request
importurllib.parse
url='https://passport.jb51.net/user/signin?'
post={
'username':'xxx',
'password':'xxxx'
}
postdata=urllib.parse.urlencode(post).encode('utf-8')
req=urllib.request.Request(url,postdata)
r=urllib.request.urlopen(req)

我们在进行注册、登录等操作时，会通过POST表单传递信息

这时，我们需要分析页面结构，构建表单数据post，使用urlencode()进行编码处理，返回字符串，再指定'utf-8'的编码格式，这是因为POSTdata只能是bytes或者fileobject。最后通过Request()对象传递postdata，使用urlopen()发送请求。

2、urllib.request.Request

urlopen()方法可以实现最基本请求的发起，但这几个简单的参数并不足以构建一个完整的请求，如果请求中需要加入headers（请求头）等信息模拟浏览器，我们就可以利用更强大的Request类来构建一个请求。

importurllib.request
importurllib.parse
url='https://passport.jb51.net/user/signin?'
post={
'username':'xxx',
'password':'xxxx'
}
postdata=urllib.parse.urlencode(post).encode('utf-8')
req=urllib.request.Request(url,postdata)
r=urllib.request.urlopen(req)

3、urllib.request.BaseHandler

在上面的过程中，我们虽然可以构造Request，但是一些更高级的操作，比如Cookies处理，代理该怎样来设置？

接下来就需要更强大的工具Handler登场了基本的urlopen()函数不支持验证、cookie、代理或其他HTTP高级功能。要支持这些功能，必须使用build_opener()函数来创建自己的自定义opener对象。

首先介绍下urllib.request.BaseHandler，它是所有其他Handler的父类，它提供了最基本的Handler的方法。

HTTPDefaultErrorHandler用于处理HTTP响应错误，错误都会抛出HTTPError类型的异常。

HTTPRedirectHandler用于处理重定向

HTTPCookieProcessor用于处理Cookie。

ProxyHandler用于设置代理，默认代理为空。

HTTPPasswordMgr用于管理密码，它维护了用户名密码的表。

HTTPBasicAuthHandler用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。

代理服务器设置

defuse_proxy(proxy_addr,url):
importurllib.request
#构建代理
proxy=urllib.request.ProxyHandler({'http':proxy_addr})
#构建opener对象
opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
#安装到全局
#urllib.request.install_opener(opener)
#data=urllib.request.urlopen(url).read().decode('utf8')以全局方式打开
data=opener.open(url)#直接用句柄方式打开
returndata
proxy_addr='61.163.39.70:9999'
data=use_proxy(proxy_addr,'//www.nhooo.com')
print(len(data))
##异常处理以及日输出

opener通常是build_opener()创建的opener对象。

install_opener(opener)安装opener作为urlopen()使用的全局URLopener

cookie的使用

获取Cookie保存到变量

importhttp.cookiejar,urllib.request
#使用http.cookiejar.CookieJar()创建CookieJar对象
cookie=http.cookiejar.CookieJar()
handler=urllib.request.HTTPCookieProcessor(cookie)
#使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象
opener=urllib.request.build_opener(handler)
#将opener安装为全局
urllib.request.install_opener(opener)
response=urllib.request.urlopen('//www.nhooo.com')
#response=opener.open('//www.nhooo.com')
foritemincookie:
print'Name='+item.name
print'Value='+item.value

首先我们必须声明一个CookieJar对象，接下来我们就需要利用HTTPCookieProcessor来构建一个handler，最后利用build_opener方法构建出opener，执行open()即可。最后循环输出cookiejar

获取Cookie保存到本地

importcookielib
importurllib
#设置保存cookie的文件，同级目录下的cookie.txt
filename='cookie.txt'
#声明一个MozillaCookieJar对象实例来保存cookie，之后写入文件
cookie=cookielib.MozillaCookieJar(filename)
#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler=urllib.request.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener=urllib.request.build_opener(handler)
#创建一个请求，原理同urllib2的urlopen
response=opener.open("//www.nhooo.com")
#保存cookie到文件
cookie.save(ignore_discard=True,ignore_expires=True)

异常处理

异常处理结构如下

try:
#要执行的代码
print(...)
except:
#try代码块里的代码如果抛出异常了，该执行什么内容
print(...)
else:
#try代码块里的代码如果没有跑出异常，就执行这里
print(...)
finally:
#不管如何，finally里的代码，是总会执行的
print(...)

URLerror产生原因：

1、网络未连接（即不能上网）

fromurllibimportrequest,error
try:
r=request.urlopen('//www.nhooo.com')
excepterror.URLErrorase:
print(e.reason)

2、访问页面不存(HTTPError)

客户端向服务器发送请求，如果成功地获得请求的资源，则返回的状态码为200，表示响应成功。如果请求的资源不存在，则通常返回404错误。

fromurllibimortrequest,error
try:
response=request.urlopen('//www.nhooo.com')
excepterror.HTTPErrorase:
print(e.reason,e.code,e.headers,sep='\n')
else:
print("RequestSuccessfully')
#加入hasattr属性提前对属性,进行判断原因
fromurllibimportrequest,error
try:
response=request.urlopen('http://blog.jb51.net')
excepterror.HTTPErrorase:
ifhasattr(e,'code'):
print('theservercouldn\'tfulfilltherequest')
print('Errorcode:',e.code)
elifhasattr(e,'reason'):
print('wefailedtoreachaserver')
print('Reason:',e.reason)
else:
print('noexceptionwasraised')
#everythingisok

下面为大家列出几个urllib模块很有代表性的实例

1、引入urllib模块

importurllib.request
response=urllib.request.urlopen('http://jb51.net/')
html=response.read()

2、使用Request

importurllib.request
req=urllib.request.Request('http://jb51.net/')
response=urllib.request.urlopen(req)
the_page=response.read()

3、发送数据

#!/usr/bin/envpython3
importurllib.parse
importurllib.request
url='http://localhost/login.php'
user_agent='Mozilla/4.0(compatible;MSIE5.5;WindowsNT)'
values={
'act':'login',
'login[email]':'yzhang@i9i8.com',
'login[password]':'123456'
}
data=urllib.parse.urlencode(values)
req=urllib.request.Request(url,data)
req.add_header('Referer','//www.nhooo.com/')
response=urllib.request.urlopen(req)
the_page=response.read()
print(the_page.decode("utf8"))

4、发送数据和header

#!/usr/bin/envpython3
importurllib.parse
importurllib.request
url='http://localhost/login.php'
user_agent='Mozilla/4.0(compatible;MSIE5.5;WindowsNT)'
values={
'act':'login',
'login[email]':'yzhang@i9i8.com',
'login[password]':'123456'
}
headers={'User-Agent':user_agent}
data=urllib.parse.urlencode(values)
req=urllib.request.Request(url,data,headers)
response=urllib.request.urlopen(req)
the_page=response.read()
print(the_page.decode("utf8"))

5、http错误

#!/usr/bin/envpython3
importurllib.request
req=urllib.request.Request('//www.nhooo.com')
try:
urllib.request.urlopen(req)
excepturllib.error.HTTPErrorase:
print(e.code)
print(e.read().decode("utf8"))

6、异常处理

#!/usr/bin/envpython3
fromurllib.requestimportRequest,urlopen
fromurllib.errorimportURLError,HTTPError
req=Request("//www.nhooo.com/")
try:
response=urlopen(req)
exceptHTTPErrorase:
print('Theservercouldn'tfulfilltherequest.')
print('Errorcode:',e.code)
exceptURLErrorase:
print('Wefailedtoreachaserver.')
print('Reason:',e.reason)
else:
print("good!")
print(response.read().decode("utf8"))

7、异常处理

fromurllib.requestimportRequest,urlopen
fromurllib.errorimportURLError
req=Request("//www.nhooo.com/")
try:
response=urlopen(req)
exceptURLErrorase:
ifhasattr(e,'reason'):
print('Wefailedtoreachaserver.')
print('Reason:',e.reason)
elifhasattr(e,'code'):
print('Theservercouldn'tfulfilltherequest.')
print('Errorcode:',e.code)
else:
print("good!")
print(response.read().decode("utf8"))

8、HTTP认证

#!/usr/bin/envpython3
importurllib.request
#createapasswordmanager
password_mgr=urllib.request.HTTPPasswordMgrWithDefaultRealm()
#Addtheusernameandpassword.
#Ifweknewtherealm,wecoulduseitinsteadofNone.
top_level_url="https://www.nhooo.com/"
password_mgr.add_password(None,top_level_url,'rekfan','xxxxxx')
handler=urllib.request.HTTPBasicAuthHandler(password_mgr)
#create"opener"(OpenerDirectorinstance)
opener=urllib.request.build_opener(handler)
#usetheopenertofetchaURL
a_url="https://www.nhooo.com/"
x=opener.open(a_url)
print(x.read())
#Installtheopener.
#Nowallcallstourllib.request.urlopenuseouropener.
urllib.request.install_opener(opener)
a=urllib.request.urlopen(a_url).read().decode('utf8')
print(a)

9、使用代理

#!/usr/bin/envpython3
importurllib.request
proxy_support=urllib.request.ProxyHandler({'sock5':'localhost:1080'})
opener=urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
a=urllib.request.urlopen("//www.nhooo.com").read().decode("utf8")
print(a)

10、超时

#!/usr/bin/envpython3
importsocket
importurllib.request
#timeoutinseconds
timeout=2
socket.setdefaulttimeout(timeout)
#thiscalltourllib.request.urlopennowusesthedefaulttimeout
#wehavesetinthesocketmodule
req=urllib.request.Request('//www.nhooo.com/')
a=urllib.request.urlopen(req).read()
print(a)

11.自己创建build_opener

header=[('User-Agent','Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/57.0.2987.133Safari/537.36')]
#创建opener对象
opener=urllib.request.build_opener()
opener.addheaders=header
#设置opener对象作为urlopen()使用的全局opener
urllib.request.install_opener(opener)
response=urllib.request.urlopen('//www.nhooo.com/')
buff=response.read()
html=buff.decode("utf8")
response.close()
print(the_page)

12.urlib.resquest.urlretrieve远程下载

header=[('User-Agent','Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/57.0.2987.133Safari/537.36')]
#创建opener对象
opener=urllib.request.build_opener()
opener.addheaders=header
#设置opener对象作为urlopen()使用的全局opener
urllib.request.install_opener(opener)
#下载文件到当前文件夹
urllib.request.urlretrieve('//www.nhooo.com/','baidu.html')
#清除urlretrieve产生的缓存
urlib.resquest.urlcleanup()

13.post请求

importurllib.request
importurllib.parse
url='//www.nhooo.com/mypost/'
#将数据使用urlencode编码处理后，使用encode()设置为utf-8编码
postdata=urllib.parse.urlencode({name:'测试名',pass:"123456"}).encode('utf-8')
#urllib.request.quote()接受字符串，
#urllib.parse.urlencode()接受字典或者列表中的二元组[(a,b),(c,d)],将URL中的键值对以连接符&划分
req=urllib.request.Request(url,postdata)
#urllib.request.Request(url,data=None,header={},origin_req_host=None,unverifiable=False,#method=None)
#url：包含URL的字符串。
#data：httprequest中使用，如果指定，则发送POST而不是GET请求。
#header：是一个字典。
#后两个参数与第三方cookie有关。
req.add_header('user-agent','User-Agent','Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/
537.36(KHTML,likeGecko)Chrome/38.0.2125.122Safari/537.36SE2.XMetaSr1.0')
data=urllib.request.urlopen(req).read()
//urlopen（）的data参数默认为None，当data参数不为空的时候，urlopen（）提交方式为Post。

14.cookie的使用

1.获取Cookie保存到变量

importurllib.request
importhttp.cookie
#声明一个CookieJar对象实例来保存cookie
cookie=cookielib.CookieJar()
#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler=urllib.request.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener=urllib.request.build_opener(handler)
#此处的open方法同urllib.request的urlopen方法，也可以传入request
urllib.request.install_opener(opener)
#使用opener或者urlretrieve方法来获取需要的网站cookie
urllib.request.urlretrieve('//www.nhooo.com/','baidu.html')
#data=urllib.request.urlopen('//www.nhooo.com/')

2.保存cookies到文件

importhttp.cookie
importurllib.request
#设置保存cookie的文件，同级目录下的cookie.txt
filename='cookie.txt'
#声明一个MozillaCookieJar对象实例来保存cookie，之后写入文件
cookie=http.cookie.MozillaCookieJar(filename)
#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler=urllib.request.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener=urllib.request.build_opener(handler)
#创建一个请求，原理同urllib的urlopen
response=opener.open("//www.nhooo.com")
#保存cookie到文件
cookie.save(ignore_discard=True,ignore_expires=True)

3.从文件中获取cookies并访问

importhttp.cookielib
importurllib.request
#创建MozillaCookieJar实例对象
cookie=http.cookie.MozillaCookieJar()
#从文件中读取cookie内容到变量
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
#创建请求的request
req=urllib.Request("//www.nhooo.com")
#利用urllib的build_opener方法创建一个opener
opener=urllib.build_opener(urllib.request.HTTPCookieProcessor(cookie))
response=opener.open(req)
print(response.read())

15.代理服务器设置

importsocket
#设置Socket连接超时时间,同时决定了urlopen的超时时间
socket.setdefaulttimeout(1)
importurllib.request
#代理服务器信息，http代理使用地址
startime=time.time()
#设置http和https代理
proxy=request.ProxyHandler({'https':'175.155.25.91:808','http':'175.155.25.91:808'})
opener=request.build_opener(proxy)
opener.addheaders=[('User-Agent','Mozilla/5.0(WindowsNT10.0;WOW64;rv:53.0)Gecko/20100101Firefox/53.0'),
#("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
#("Accept-Language","zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"),
#("Accept-Encoding","gzip,deflate,br"),
#("Connection","keep-alive"),
#("Pragma","no-cache"),
#("Cache-Control","no-cache")
]
request.install_opener(opener)
#data=request.urlopen('https://www.nhooo.com/find-ip-address').read()
data=request.urlopen('http://www.ipip.net/').read().decode('utf-8')
#data=gzip.decompress(data).decode('utf-8','ignore')
endtime=time.time()
delay=endtime-startime
print(data)

有时在urlopen的data数据直接decode(‘utf-8')会失败，必须要使用gzip.decompress(‘utf-8','ignore')才能打开，猜测应该是header的问题，换一个有时会好

本文主要讲解了python爬虫模块urllib详细使用方法与实例全解，更多关于python爬虫模块urllib详细使用方法与实例请查看下面的相关链接

声明：本文内容来源于网络，版权归原作者所有，内容由互联网用户自发贡献自行上传，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任。如果您发现有涉嫌版权的内容，欢迎发送邮件至：czq8825#qq.com（发邮件时，请将#更换为@）进行举报，并提供相关证据，一经查实，本站将立刻删除涉嫌侵权内容。