新闻中心
新闻中心与新手教程
新闻中心与新手教程
发布时间:2024-10-10 23:15:49
提供一个更全面的爬虫示例,包括处理各种常见情况的详细步骤。我们将创建一个爬取新闻网站的爬虫,这个例子会涵盖多个方面。
让我们一步步来:
首先,我们需要设置我们的项目环境。
mkdir news_scraper
cd news_scraper
python -m venv venv
source venv/bin/activate # 在windows上使用 venvscriptsactivate
pip install requests beautifulsoup4 pandas lxml selenium fake-useragent
创建以下文件:
main.py
: 主程序scraper.py
: 爬虫类utils.py
: 工具函数让我们从 utils.py
开始:
# utils.py
import time
from fake_useragent import useragent
import requests
def get_random_ua():
ua = useragent()
return ua.random
def make_request(url, max_retries=3, delay=1):
headers = {'user-agent': get_random_ua()}
for i in range(max_retries):
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response
except requests.requestexception as e:
print(f"请求失败 (尝试 {i+1}/{max_retries}): {e}")
if i < max_retries - 1:
time.sleep(delay)
return none
def is_valid_url(url):
try:
result = requests.urlparse(url)
return all([result.scheme, result.netloc])
except valueerror:
return false
现在,让我们创建 scraper.py
:
# scraper.py
import time
from bs4 import beautifulsoup
from selenium import webdriver
from selenium.webdriver.chrome.options import options
from selenium.webdriver.common.by import by
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as ec
from utils import make_request, is_valid_url
class newsscraper:
def __init__(self, base_url):
self.base_url = base_url
self.articles = []
def scrape_headlines(self):
response = make_request(self.base_url)
if not response:
print("无法获取网页内容")
return
soup = beautifulsoup(response.content, 'lxml')
headlines = soup.find_all('h2', class_='headline')
for headline in headlines:
article = {
'title': headline.text.strip(),
'url': headline.find('a')['href'] if headline.find('a') else none
}
if article['url'] and not is_valid_url(article['url']):
article['url'] = f"{self.base_url.rstrip('/')}/{article['url'].lstrip('/')}"
self.articles.append(article)
def scrape_article_content(self):
chrome_options = options()
chrome_options.add_argument("--headless")
with webdriver.chrome(options=chrome_options) as driver:
for article in self.articles:
if not article['url']:
continue
driver.get(article['url'])
try:
content = webdriverwait(driver, 10).until(
ec.presence_of_element_located((by.class_name, "article-content"))
)
article['content'] = content.text
except exception as e:
print(f"无法获取文章内容: {e}")
article['content'] = none
time.sleep(1) # 避免请求过于频繁
def get_articles(self):
return self.articles
最后,创建 main.py
:
# main.py
import pandas as pd
from scraper import newsscraper
def main():
base_url = "https://example-news-site.com"
scraper = newsscraper(base_url)
print("正在爬取新闻标题...")
scraper.scrape_headlines()
print("正在爬取文章内容...")
scraper.scrape_article_content()
articles = scraper.get_articles()
if articles:
df = pd.dataframe(articles)
df.to_csv('news_articles.csv', index=false)
print(f"已保存 {len(articles)} 篇文章到 news_articles.csv")
else:
print("未找到文章")
if __name__ == "__main__":
main()
在命令行中运行:
python main.py
a. 环境设置:
b. 工具函数(utils.py):
get_random_ua()
: 生成随机user-agent,有助于避免被检测为爬虫make_request()
: 发送http请求,包含重试机制和错误处理is_valid_url()
: 验证url是否有效,用于处理相对urlc. 爬虫类(scraper.py):
scrape_headlines()
: 爬取新闻标题和url
scrape_article_content()
: 爬取文章内容
d. 主程序(main.py):
a. 反爬虫措施:
b. 动态内容:
c. 连接问题:
d. 数据解析:
e. 数据存储:
在 scraper.py
中添加以下方法:
from urllib.robotparser import robotfileparser
class newsscraper:
# ... 其他代码 ...
def check_robots_txt(self):
rp = robotfileparser()
rp.set_url(f"{self.base_url}/robots.txt")
rp.read()
if not rp.can_fetch("*", self.base_url):
print("根据robots.txt,不允许爬取此网站")
return false
return true
# 在 scrape_headlines 方法开始时调用
if not self.check_robots_txt():
return
这个详细的示例涵盖了python爬虫的多个方面,包括基本的html解析、动态内容处理、反爬虫策略、错误处理和数据存储。
感谢提供:05互联