requests入门使用

March 24, 2020

1844 8 minutes and 22 seconds

[DeadPool爬虫]requests入门使用

在进行完Python的基础学习之后，如何提高Python的相关编程能力？个人觉得最快捷的手段还是从Python的项目实战开始，说道Python的项目实战，爬虫、Web开发和机器学习是当下应用最广泛的领域了，那么就正好借此机会整理一个爬虫的系列内容，一是通过授课和实战，解决代码编程量少的问题；二是锻炼自己在进行整体项目开发的时候代码组织能力；三是帮自己整理思路，把相关的爬虫知识进行查漏补缺。

requests 包

在学习整体的爬虫框架之前，也是先从相关的基础技术开始介绍起。首当其冲的就是requests包的使用，无论什么样的请求都要从最开始的HTTP请求开始。requests是Python的第三方库，但是在处理网络资源访问时，堪称神器~

这里特别要引用官方文档开头的一段话，十分让人心情愉悦~

Requests 唯一的一个非转基因的 Python HTTP 库，人类可以安全享用。

警告：非专业使用其他 HTTP 库会导致危险的副作用，包括：安全缺陷症、冗余代码症、重新发明轮子症、啃文档症、抑郁、头疼、甚至死亡。

那么 requests 的是否真的像它描述的这么神奇？可以参见官方给的对比例子~

这里是Python内置的urllib包进行访问

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2

gh_url = 'https://api.github.com'

req = urllib2.Request(gh_url)

password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, gh_url, 'user', 'pass')

auth_manager = urllib2.HTTPBasicAuthHandler(password_manager)
opener = urllib2.build_opener(auth_manager)

urllib2.install_opener(opener)

handler = urllib2.urlopen(req)

print handler.getcode()
print handler.headers.getheader('content-type')

# ------
# 200
# 'application/json'

这里是requests包进行访问

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests

r = requests.get('https://api.github.com', auth=('user', 'pass'))

print r.status_code
print r.headers['content-type']

# ------
# 200
# 'application/json'

虽然针对Python3的现在是否requests还是最优的选择，暂不讨论，但是还是可以窥见在那个时候其杀伤力多么强大~

requests 使用

先声明一下，这个课程的内容先只是简单入门教程，还不涉及到高级用法~

PS:个人意见，与其看我BB，不如直接把官方文档好好读一读~ requests官方文档

总结创建请求

虽然官方文档说了一小段如何进行requests创建请求，但是总结起来的时候可以发现结合现代风格的REST API更好记忆

# 进行HEAD请求
requests.head('https://httpbin.org/')

# 进行OPTIONS请求
requests.options('https://httpbin.org/')

# 进行GET请求
requests.get('https://httpbin.org/get')

# 进行POST请求
requests.post('https://httpbin.org/post', data = {'key':'value'})

# 进行PUT请求
requests.put('https://httpbin.org/put', data = {'key':'value'})

# 进行DELETE请求
requests.delete('https://httpbin.org/delete')

response类型

使用requests进行请求之后返回的response主要有以下几个类型：

# Regular Response 常规类型
r = requests.get('https://httpbin.org/')
r.text

# RETURN HTML BODY

# Json Response Json类型
r = requests.get('https://httpbin.org/')
r.json()

# RETURN JSON BODY

# Binary Response 二进制类型
r = requests.get('https://httpbin.org/')
r.content

# RETURN BINARY BODY


# 附加的原始内容
# Raw Response 原始内容
r = requests.get('https://httpbin.org/')
r.raw

# RETURN RAW BYTES
# PS: 这里是二进制的bytes流，你得自己对字节流进行手动处理了~

这里其实也就是常见的HTTP返回格式，有点前端开发基础的就很好记了~

requests&&response header

自定义 requests header

# 定义header内容，按照要求自定义就行了
headers = {'user-agent': 'my-app/0.0.1'}

r = requests.get('https://httpbin.org/get', headers=headers)

查看 response header

r = requests.get('https://httpbin.org/get')
r.headers

# 查看全部返回的headers内容
{
    'content-encoding': 'gzip',
    'transfer-encoding': 'chunked',
    'connection': 'close',
    'server': 'nginx/1.0.4',
    'x-runtime': '148ms',
    'etag': '"e1ca502697e5c9317743dc078f67693f"',
    'content-type': 'application/json'
}

# 获取指定字段
r.headers['Content-Type']

requests 参数

这里初级使用方式只介绍两种请求的时候带参数的方式：

GET请求带参数

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)

POST请求带参数

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("https://httpbin.org/post", data=payload)

剩下的待参数的方式就是属于 requests 的高级用法了~

requests Cookies

同requests的header一样，这里也是请求时带上cookies和返回时获取cookies

获取cookies


url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)

# 获取全部的cookies
r.cookies

# 获得指定cookies
r.cookies['example_cookie_name']

请求时带上cookies


url = 'https://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)

这里官网做了说明返回的cookies其实是一个对象RequestsCookieJar而非常规的字符串，所以可以设置复杂的cookies进行操作


jar = requests.cookies.RequestsCookieJar()
jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
url = 'https://httpbin.org/cookies'
r = requests.get(url, cookies=jar)

requests Timeout

首先要声明一点requests的超时并不是想象中的请求整个动作的过程的超时，而是在对请求发出之后服务器应答这个过程进行事件判断，更精确的说，是在设置的timeout秒内没有从基础套接字上接受到任何字节的数据时，如果结合socket编程的话，就会更明显的有感受，但是这个不属于Python的基础开发了。

但是也建议在规模化请求资源的时候，加上这个参数，防止程序失去响应


requests.get('http://github.com', timeout=0.001)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

实战练习

其实requests的使用并不是很难，结合一个实战的例子就很好掌握了，下面就是结合上面所说的实战练习

目标是：通过请求B站的舞曲链接地址，获得页面内容，然后从中匹配到舞蹈视频的链接，然后在请求每个舞蹈视频的地址，输出页面内容

代码链接来自我女朋友的Github-2020-03-24


#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@Time    : 2020/3/25 16:47
@Author  : WangHuan
@Contact : hi_chengzi@126.com
@File    : 2020-03-24.py
@Software: PyCharm
@description: 
"""


import re
import requests


seed_url = "https://www.bilibili.com/v/dance"

regex = r"\<a href=\"\/\/www\.bilibili\.com\/video\/[a-zA-Z0-9]+\""


def fetch_seed_context():
    r = requests.get(seed_url)
    with open("index.html", "w", encoding="utf-8") as index_obj:
        index_obj.write(r.text)


def extract_herf():
    with open("index.html", "r", encoding="utf-8") as index_obj:
        content = index_obj.read()
        matches = re.finditer(regex, content, re.MULTILINE)

        for matchNum, match in enumerate(matches, start=1):
            target_url = "https:{0}".format(match.group()[9:-1])
            print(target_url)

            next_r = requests.get(target_url)
            print(next_r)


if __name__ == '__main__':
    fetch_seed_context()
    extract_herf()