python爬虫分类和robots协议详细步骤-NJ0827.NET

提供一个更全面的爬虫示例，包括处理各种常见情况的详细步骤。我们将创建一个爬取新闻网站的爬虫，这个例子会涵盖多个方面。

让我们一步步来：

项目设置

首先，我们需要设置我们的项目环境。

bash

mkdir news_scraper
cd news_scraper
python -m venv venv
source venv/bin/activate # 在windows上使用 venvscriptsactivate
pip install requests beautifulsoup4 pandas lxml selenium fake-useragent

基本结构

创建以下文件：

main.py: 主程序
scraper.py: 爬虫类
utils.py: 工具函数

编写代码

让我们从 utils.py 开始：

python

# utils.py
import time
from fake_useragent import useragent
import requests

def get_random_ua():
ua = useragent()
return ua.random

def make_request(url, max_retries=3, delay=1):
headers = {'user-agent': get_random_ua()}
for i in range(max_retries):
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response
except requests.requestexception as e:
print(f"请求失败 (尝试 {i+1}/{max_retries}): {e}")
if i < max_retries - 1:
time.sleep(delay)
return none

def is_valid_url(url):
try:
result = requests.urlparse(url)
return all([result.scheme, result.netloc])
except valueerror:
return false

现在，让我们创建 scraper.py：

python

# scraper.py
import time
from bs4 import beautifulsoup
from selenium import webdriver
from selenium.webdriver.chrome.options import options
from selenium.webdriver.common.by import by
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as ec
from utils import make_request, is_valid_url

class newsscraper:
def __init__(self, base_url):
self.base_url = base_url
self.articles = []

def scrape_headlines(self):
response = make_request(self.base_url)
if not response:
print("无法获取网页内容")
return

soup = beautifulsoup(response.content, 'lxml')
headlines = soup.find_all('h2', class_='headline')

for headline in headlines:
article = {
'title': headline.text.strip(),
'url': headline.find('a')['href'] if headline.find('a') else none
}
if article['url'] and not is_valid_url(article['url']):
article['url'] = f"{self.base_url.rstrip('/')}/{article['url'].lstrip('/')}"
self.articles.append(article)

def scrape_article_content(self):
chrome_options = options()
chrome_options.add_argument("--headless")

with webdriver.chrome(options=chrome_options) as driver:
for article in self.articles:
if not article['url']:
continue

driver.get(article['url'])

try:
content = webdriverwait(driver, 10).until(
ec.presence_of_element_located((by.class_name, "article-content"))
)
article['content'] = content.text
except exception as e:
print(f"无法获取文章内容: {e}")
article['content'] = none

time.sleep(1) # 避免请求过于频繁

def get_articles(self):
return self.articles

最后，创建 main.py：

python

# main.py
import pandas as pd
from scraper import newsscraper

def main():
base_url = "https://example-news-site.com"
scraper = newsscraper(base_url)

print("正在爬取新闻标题...")
scraper.scrape_headlines()

print("正在爬取文章内容...")
scraper.scrape_article_content()

articles = scraper.get_articles()

if articles:
df = pd.dataframe(articles)
df.to_csv('news_articles.csv', index=false)
print(f"已保存 {len(articles)} 篇文章到 news_articles.csv")
else:
print("未找到文章")

if __name__ == "__main__":
main()

运行爬虫

在命令行中运行：

bash

python main.py

详细步骤解释

a. 环境设置：

创建虚拟环境确保项目依赖独立
安装必要的库：requests用于http请求，beautifulsoup用于html解析，pandas用于数据处理，selenium用于处理动态内容

b. 工具函数（utils.py）：

get_random_ua(): 生成随机user-agent，有助于避免被检测为爬虫
make_request(): 发送http请求，包含重试机制和错误处理
is_valid_url(): 验证url是否有效，用于处理相对url

c. 爬虫类（scraper.py）：

scrape_headlines(): 爬取新闻标题和url
- 使用beautifulsoup解析html
- 处理相对url，确保所有url都是完整的
scrape_article_content(): 爬取文章内容
- 使用selenium处理可能的动态内容
- 等待特定元素加载，避免内容缺失
- 加入延时，避免请求过于频繁

d. 主程序（main.py）：

创建爬虫实例
调用方法爬取标题和内容
将结果保存为csv文件

处理常见问题

a. 反爬虫措施：

使用随机user-agent
加入请求延迟
使用selenium模拟真实浏览器行为

b. 动态内容：

使用selenium加载javascript渲染的内容

c. 连接问题：

实现请求重试机制
使用 try-except 捕获和处理网络异常

d. 数据解析：

使用 lxml 解析器提高性能
处理可能的空值和缺失数据

e. 数据存储：

使用pandas将数据保存为csv格式，便于后续分析

进一步改进

添加命令行参数，允许用户指定目标url和输出文件
实现多线程或异步爬取，提高效率
添加日志记录，方便调试和监控
实现代理ip轮换，进一步规避反爬措施
添加更复杂的数据清洗和预处理步骤

遵守robots协议

在 scraper.py 中添加以下方法：

python

from urllib.robotparser import robotfileparser

class newsscraper:
# ... 其他代码 ...

def check_robots_txt(self):
rp = robotfileparser()
rp.set_url(f"{self.base_url}/robots.txt")
rp.read()

if not rp.can_fetch("*", self.base_url):
print("根据robots.txt，不允许爬取此网站")
return false
return true

# 在 scrape_headlines 方法开始时调用
if not self.check_robots_txt():
return

这个详细的示例涵盖了python爬虫的多个方面，包括基本的html解析、动态内容处理、反爬虫策略、错误处理和数据存储。

感谢提供：05互联

首页

云服务器

服务器租用

虚拟主机

新闻资讯

关于我们

最新活动

新闻中心

python爬虫分类和robots协议详细步骤

联系我们

产品服务

新闻资讯

关于我们

快速入口