新闻中心
新闻中心与新手教程
新闻中心与新手教程
发布时间:2024-10-10 23:29:56
我会为您详细介绍如何在python中使用urllib3和requests库。这两个库都是用于发送http请求的强大工具,但有一些不同之处。让我们逐步深入探讨它们的使用方法。
urllib3是一个功能强大的http客户端库,提供了线程安全的连接池和文件分部上传等特性。
1.1 安装urllib3
pip install urllib3
1.2 基本使用
import urllib3
# 禁用警告(在生产环境中不建议这么做)
urllib3.disable_warnings()
# 创建 poolmanager 实例
http = urllib3.poolmanager()
# 发送get请求
response = http.request('get', 'https://api.example.com/users')
# 打印响应状态和数据
print(response.status)
print(response.data.decode('utf-8'))
1.3 发送post请求
import json
data = {'username': 'john', 'password': 'secret'}
encoded_data = json.dumps(data).encode('utf-8')
response = http.request(
'post',
'https://api.example.com/login',
body=encoded_data,
headers={'content-type': 'application/json'}
)
print(response.status)
print(response.data.decode('utf-8'))
1.4 处理超时和重试
from urllib3.util.retry import retry
from urllib3.util.timeout import timeout
retries = retry(total=3, backoff_factor=0.1)
timeout = timeout(connect=5.0, read=10.0)
http = urllib3.poolmanager(retries=retries, timeout=timeout)
response = http.request('get', 'https://api.example.com/users')
1.5 使用连接池
# 创建一个连接池
pool = urllib3.httpconnectionpool('api.example.com', maxsize=10)
# 使用连接池发送请求
response = pool.request('get', '/users')
1.6 处理ssl证书验证
import certifi
http = urllib3.poolmanager(
cert_reqs='cert_required',
ca_certs=certifi.where()
)
response = http.request('get', 'https://api.example.com/users')
requests是一个更高级的http库,提供了更简单的api和更多的功能。
2.1 安装requests
2.2 基本使用
import requests
# 发送get请求
response = requests.get('https://api.example.com/users')
# 打印响应状态和内容
print(response.status_code)
print(response.text)
# 自动解析json响应
data = response.json()
print(data)
2.3 发送post请求
data = {'username': 'john', 'password': 'secret'}
response = requests.post('https://api.example.com/login', json=data)
print(response.status_code)
print(response.json())
2.4 自定义请求头
headers = {'user-agent': 'myapp/1.0'}
response = requests.get('https://api.example.com/users', headers=headers)
2.5 处理会话和cookie
# 创建会话
session = requests.session()
# 登录
login_data = {'username': 'john', 'password': 'secret'}
session.post('https://api.example.com/login', json=login_data)
# 使用同一会话发送后续请求
response = session.get('https://api.example.com/dashboard')
2.6 文件上传
files = {'file': open('document.pdf', 'rb')}
response = requests.post('https://api.example.com/upload', files=files)
2.7 处理超时和重试
from requests.adapters import httpadapter
from requests.packages.urllib3.util.retry import retry
retry_strategy = retry(
total=3,
backoff_factor=0.1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = httpadapter(max_retries=retry_strategy)
session = requests.session()
session.mount('https://', adapter)
session.mount('http://', adapter)
response = session.get('https://api.example.com/users', timeout=5)
2.8 流式请求
with requests.get('https://api.example.com/large-file', stream=true) as r:
r.raise_for_status()
with open('large-file.zip', 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
import requests
from bs4 import beautifulsoup
def scrape_website(url):
headers = {
'user-agent': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/91.0.4472.124 safari/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # 如果请求不成功则抛出异常
soup = beautifulsoup(response.text, 'html.parser')
# 提取标题
title = soup.title.string if soup.title else 'no title found'
# 提取所有段落文本
paragraphs = [p.text for p in soup.find_all('p')]
# 提取所有链接
links = [a['href'] for a in soup.find_all('a', href=true)]
return {
'title': title,
'paragraphs': paragraphs,
'links': links
}
except requests.requestexception as e:
print(f"请求错误: {e}")
return none
# 使用示例
result = scrape_website('https://example.com')
if result:
print(f"title: {result['title']}")
print(f"number of paragraphs: {len(result['paragraphs'])}")
print(f"number of links: {len(result['links'])}")
选择使用哪个库取决于您的具体需求。对于大多数情况,requests 足够满足需求并且更容易使用。但如果您需要更底层的控制或者在性能关键的应用中,urllib3 可能是更好的选择。
这个详细指南涵盖了 urllib3 和 requests 的基本到高级用法。在实际应用中,您可能需要根据具体的爬取目标和网站结构来调整这些代码。记得始终遵守网站的 robots.txt 规则和使用条款,确保您的爬虫行为是合法和道德的。
感谢提供:05互联