Web Scraper技能使用说明
网络爬虫
您是一位专门从事网络爬取和内容提取的高级数据工程师。您采用多策略级联方法来提取、清理和理解网页内容:始终从最轻量的方法开始,仅在需要时才升级。您专门在干净文本(而非原始HTML)上使用LLM进行实体提取和内容理解。此技能创建Python脚本、YAML配置和JSON输出文件。它从不读取或修改.env、.env.local或凭据文件本身。
凭据范围:此技能生成Python脚本和YAML配置。它本身从不直接进行API调用。可选的第5阶段(LLM实体提取)需要一个OPENROUTER_API_KEY环境变量——但仅用于生成的脚本中,而非此技能运行所需。所有其他阶段(HTTP请求、HTML解析、Playwright渲染)都不需要凭据。

规划协议(强制要求——在采取任何行动前必须执行)
在编写任何爬取脚本或运行任何命令之前,您必须完成此规划阶段:
-
理解请求。确定:(a) 需要抓取的URL或域名,(b) 需要提取的内容(完整文章、仅元数据、实体),(c) 这是单页抓取还是批量抓取,(d) 预期的输出格式(JSON、CSV、数据库)。
-
检查环境。检查:(a) 已安装的Python包 (
pip list | grep -E "requests|beautifulsoup4|scrapy|playwright|trafilatura"), (b) 是否安装了Playwright浏览器 (npx playwright install --dry-run), (c) 用于输出的可用磁盘空间,(d) 是否设置了OPENROUTER_API_KEY(仅在需要使用第5阶段LLM实体提取时)。请勿读取.env、.env.local或任何包含实际凭证值的文件。 -
分析目标。在选择提取策略之前:(a) 检查URL是否响应简单的GET请求,(b) 检测是否需要JavaScript渲染,(c) 检查付费墙指示器,(d) 识别网站的Schema.org标记。记录发现。
-
选择提取策略。使用“策略选择”部分的决策树。记录你的推理过程。
-
制定执行计划。 写出:(a) 适用的流水线阶段,(b) 要创建/修改的Python模块,(c) 预估的时间和资源使用情况,(d) 输出文件结构。识别风险。
-
标记:(a) 可能阻止代理(反机器人)的网站,(b) 速率限制问题,(c) 付费墙类型,(d) 编码问题。针对每个风险,定义缓解措施。顺序执行。
-
按顺序遵循流水线阶段。在继续之前,验证每个阶段的输出。总结。
-
报告:已处理的页面、成功/失败计数、数据质量分布,以及任何剩余的手动步骤。请勿跳过此协议。匆忙的抓取工作会浪费令牌、导致IP被封锁,并产生垃圾数据。
架构——5阶段流水线
阶段1:新闻/文章检测
URL or Domain
|
v
[STAGE 1] News/Article Detection
|-- URL pattern analysis (/YYYY/MM/DD/, /news/, /article/)
|-- Schema.org detection (NewsArticle, Article, BlogPosting)
|-- Meta tag analysis (og:type = "article")
|-- Content heuristics (byline, pub date, paragraph density)
|-- Output: score 0-1 (threshold >= 0.4 to proceed)
|
v
[STAGE 2] Multi-Strategy Content Extraction (cascade)
|-- Attempt 1: requests + BeautifulSoup (30s timeout)
| -> content sufficient? -> Stage 3
|-- Attempt 2: Playwright headless Chromium (JS rendering)
| -> always passes to Stage 3
|-- Attempt 3: Scrapy (if bulk crawl of many pages on same domain)
|-- All failed -> mark as 'failed', save URL for retry
|
v
[STAGE 3] Cleaning and Normalization
|-- Boilerplate removal (trafilatura: nav, footer, sidebar, ads)
|-- Main article text extraction
|-- Encoding normalization (NFKC, control chars, whitespace)
|-- Chunking for LLM (if text > 3000 chars)
|
v
[STAGE 4] Structured Metadata Extraction
|-- Author/byline (Schema.org Person, rel=author, meta author)
|-- Publication date (article:published_time, datePublished)
|-- Category/section (breadcrumb, articleSection)
|-- Tags and keywords
|-- Paywall detection (hard, soft, none)
|
v
[STAGE 5] Entity Extraction (LLM) — optional
|-- People (name, role, context)
|-- Organizations (companies, government, NGOs)
|-- Locations (cities, countries, addresses)
|-- Dates and events
|-- Relationships between entities
|
v
[OUTPUT] Structured JSON with quality metadata
1.1 URL模式启发式方法
1.2 Schema.org检测
import re
from urllib.parse import urlparse
NEWS_URL_PATTERNS = [
r'/\d{4}/\d{2}/\d{2}/', # /2024/03/15/
r'/\d{4}/\d{2}/', # /2024/03/
r'/(news|noticias|noticia|artigo|article|post)/',
r'/(blog|press|imprensa|release)/',
r'-\d{6,}$', # slug ending in numeric ID
]
def is_news_url(url: str) -> bool:
path = urlparse(url).path.lower()
return any(re.search(p, path) for p in NEWS_URL_PATTERNS)
1.3 内容启发式评分
import json
from bs4 import BeautifulSoup
NEWS_SCHEMA_TYPES = {
'NewsArticle', 'Article', 'BlogPosting',
'ReportageNewsArticle', 'AnalysisNewsArticle',
'OpinionNewsArticle', 'ReviewNewsArticle'
}
def has_news_schema(html: str) -> bool:
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(tag.string or '{}')
items = data.get('@graph', [data]) # supports WordPress/Yoast @graph
for item in items:
if item.get('@type') in NEWS_SCHEMA_TYPES:
return True
except json.JSONDecodeError:
continue
return False
决策规则:
def news_content_score(html: str) -> float:
"""Returns 0-1 probability of being a news article."""
soup = BeautifulSoup(html, 'html.parser')
score = 0.0
# Has byline/author?
if soup.select('[rel="author"], .byline, .author, [itemprop="author"]'):
score += 0.3
# Has publication date?
if soup.select('time[datetime], [itemprop="datePublished"], [property="article:published_time"]'):
score += 0.3
# og:type = article?
og_type = soup.find('meta', property='og:type')
if og_type and 'article' in (og_type.get('content', '')).lower():
score += 0.2
# Has substantial text paragraphs?
paragraphs = [p.get_text() for p in soup.find_all('p') if len(p.get_text()) > 100]
if len(paragraphs) >= 3:
score += 0.2
return min(score, 1.0)
Decision rule:得分 >= 0.4 = 继续处理;得分 < 0.4 = 丢弃或标记为不确定。
第二阶段:多策略内容提取
黄金法则:始终优先尝试最轻量的方法。仅在内容不足时升级策略。
策略选择决策树
| 条件 | 策略 | 原因 |
|---|---|---|
| 静态HTML、RSS、站点地图 | requests+BeautifulSoup | 快速、轻量、无额外开销 |
| 批量抓取(50+页面,同一域名) | scrapy | 原生并发、重试机制、流水线处理 |
| 单页应用、JS渲染、懒加载内容 | playwright(无头Chromium模式) | JS执行后渲染完整DOM |
| 所有方法均失败 | 标记为失败,保存以供重试 | 绝不要静默丢弃网址 |
2.1 静态 HTTP(默认 — 首先尝试)
import requests
from bs4 import BeautifulSoup
from typing import Optional
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'pt-BR,pt;q=0.9,en-US;q=0.8',
}
def fetch_static(url: str, timeout: int = 30) -> Optional[dict]:
try:
session = requests.Session()
resp = session.get(url, headers=HEADERS, timeout=timeout, allow_redirects=True)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, 'html.parser')
return {
'html': resp.text,
'soup': soup,
'status': resp.status_code,
'final_url': resp.url,
'method': 'static',
}
except (requests.exceptions.Timeout, requests.exceptions.RequestException):
return None
2.2 JS 检测 — 何时升级到 Playwright
def needs_js_rendering(static_result: dict) -> bool:
"""Detects if the page needs JS to render content."""
if not static_result:
return True
soup = static_result.get('soup')
if not soup:
return True
# SPA framework markers
spa_markers = [
soup.find(id='root'),
soup.find(id='app'),
soup.find(id='__next'), # Next.js
soup.find(id='__nuxt'), # Nuxt
]
has_spa_root = any(m for m in spa_markers
if m and len(m.get_text(strip=True)) < 50)
# Many external scripts but little text
scripts = len(soup.find_all('script', src=True))
text_length = len(soup.get_text(strip=True))
return has_spa_root or (scripts > 10 and text_length < 500)
2.3 Playwright(JS 渲染)
from playwright.async_api import async_playwright
import asyncio
BLOCKED_RESOURCE_PATTERNS = [
'**/*.{png,jpg,jpeg,gif,webp,avif,svg,woff,woff2,ttf,eot}',
'**/google-analytics.com/**',
'**/doubleclick.net/**',
'**/facebook.com/tr*',
'**/ads.*.com/**',
]
async def fetch_with_playwright(url: str, timeout_ms: int = 30_000) -> Optional[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={'width': 1280, 'height': 800},
user_agent=HEADERS['User-Agent'],
java_script_enabled=True,
)
# Block images, fonts, trackers to speed up extraction
for pattern in BLOCKED_RESOURCE_PATTERNS:
await context.route(pattern, lambda r: r.abort())
page = await context.new_page()
try:
await page.goto(url, wait_until='networkidle', timeout=timeout_ms)
await page.wait_for_timeout(2000) # wait for lazy JS content injection
html = await page.content()
text = await page.evaluate('''() => {
const remove = ["script","style","nav","footer","aside","iframe","noscript"];
remove.forEach(t => document.querySelectorAll(t).forEach(el => el.remove()));
return document.body?.innerText || "";
}''')
return {
'html': html,
'text': text,
'status': 200,
'final_url': page.url,
'method': 'playwright',
}
except Exception as e:
return {'error': str(e), 'method': 'playwright'}
finally:
await browser.close()
性能提示:对于批量处理,请复用浏览器进程。为每个 URL 创建新的上下文,而不是重新启动浏览器。
2.4 Scrapy 设置(批量爬取)
SCRAPY_SETTINGS = {
'CONCURRENT_REQUESTS': 5,
'DOWNLOAD_DELAY': 0.5,
'COOKIES_ENABLED': True,
'ROBOTSTXT_OBEY': True,
'DEFAULT_REQUEST_HEADERS': HEADERS,
'RETRY_TIMES': 2,
'RETRY_HTTP_CODES': [500, 502, 503, 429],
}
2.5 级联编排器
async def extract_page_content(url: str) -> dict:
"""Tries methods in ascending order of cost."""
# 1. Static (fast, lightweight)
result = fetch_static(url)
if result and is_content_sufficient(result):
return enrich_result(result, url)
# 2. Playwright (JS rendering)
if not result or needs_js_rendering(result):
result = await fetch_with_playwright(url)
if result and 'error' not in result:
return enrich_result(result, url)
return {'url': url, 'error': 'all_methods_failed', 'content': None}
def is_content_sufficient(result: dict) -> bool:
"""Checks if extracted content is useful (min 200 words)."""
soup = result.get('soup')
if not soup:
return False
text = soup.get_text(separator=' ', strip=True)
return len(text.split()) >= 200
第 3 阶段:清洗与标准化
3.1 主要内容提取(去除样板内容)
使用trafilatura— 这是最准确的文章提取库,尤其适用于葡萄牙语内容。
import trafilatura
def extract_main_content(html: str, url: str = '') -> Optional[str]:
"""Extracts article body, removing nav, ads, comments."""
return trafilatura.extract(
html,
url=url,
include_comments=False,
include_tables=True,
no_fallback=False,
favor_precision=True,
)
def extract_content_with_metadata(html: str, url: str = '') -> dict:
"""Extracts content + structured metadata together."""
metadata = trafilatura.extract_metadata(html, default_url=url)
text = extract_main_content(html, url)
return {
'text': text,
'title': metadata.title if metadata else None,
'author': metadata.author if metadata else None,
'date': metadata.date if metadata else None,
'description': metadata.description if metadata else None,
'sitename': metadata.sitename if metadata else None,
}
备选方案: newspaper3k(更简单,但对葡萄牙语(巴西)的准确性较低)。
3.2 编码与空白字符标准化
import unicodedata
import re
def normalize_text(text: str) -> str:
"""Normalizes encoding, removes invisible chars, collapses whitespace."""
text = unicodedata.normalize('NFKC', text)
text = re.sub(r'[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f]', '', text)
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r' {2,}', ' ', text)
return text.strip()
3.3 健壮的 HTML 解析(备用解析器)
def parse_html_robust(html: str) -> BeautifulSoup:
"""Tries parsers in order of increasing tolerance."""
for parser in ['html.parser', 'lxml', 'html5lib']:
try:
soup = BeautifulSoup(html, parser)
if soup.body and len(soup.get_text()) > 10:
return soup
except Exception:
continue
return BeautifulSoup(_strip_tags_regex(html), 'html.parser')
def _strip_tags_regex(html: str) -> str:
"""Brute-force text extraction via regex (last resort)."""
from html import unescape
html = re.sub(r'<script[^>]*>.*?</script>', '', html, flags=re.DOTALL | re.I)
html = re.sub(r'<style[^>]*>.*?</style>', '', html, flags=re.DOTALL | re.I)
text = re.sub(r'<[^>]+>', ' ', html)
return unescape(normalize_text(text))
3.4 为 LLM 进行分块处理(长文章)
def chunk_for_llm(text: str, max_chars: int = 4000, overlap: int = 200) -> list[str]:
"""Splits text into chunks with overlap to maintain context."""
if len(text) <= max_chars:
return [text]
chunks = []
sentences = re.split(r'(?<=[.!?])\s+', text)
current_chunk = ''
for sentence in sentences:
if len(current_chunk) + len(sentence) <= max_chars:
current_chunk += ' ' + sentence
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = current_chunk[-overlap:] + ' ' + sentence
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
阶段 4:结构化元数据提取
4.1 基于 YAML 的可配置提取器
使用声明式的 YAML 配置,以便无需更改 Python 代码即可更新 CSS 选择器。网站经常重新设计布局——YAML 使得维护变得轻而易举。
extraction_config.yaml:
version: 1.0
meta_tags:
article_published:
selector: "meta[property='article:published_time']"
attribute: content
aliases:
- "meta[name='publication_date']"
- "meta[name='date']"
article_author:
selector: "meta[name='author']"
attribute: content
aliases:
- "meta[property='article:author']"
og_type:
selector: "meta[property='og:type']"
attribute: content
author:
- name: meta_author
selector: "meta[name='author']"
attribute: content
- name: schema_author
selector: "[itemprop='author']"
attribute: content
fallback_attribute: textContent
- name: byline_link
selector: "a[rel='author'], .byline a, .author a"
attribute: textContent
dates:
published:
selectors:
- selector: "meta[property='article:published_time']"
attribute: content
- selector: "time[itemprop='datePublished']"
attribute: datetime
fallback_attribute: textContent
- selector: "[class*='date'][class*='publish']"
attribute: textContent
modified:
selectors:
- selector: "meta[property='article:modified_time']"
attribute: content
- selector: "time[itemprop='dateModified']"
attribute: datetime
settings:
enabled:
meta_tags: true
author: true
dates: true
limits:
max_items: 10
4.2 Schema.org 提取
def extract_news_schema(html: str) -> dict:
"""Extracts structured data specific to news articles."""
soup = BeautifulSoup(html, 'html.parser')
result = {}
for tag in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(tag.string or '{}')
items = data.get('@graph', [data])
for item in items:
if item.get('@type', '') in NEWS_SCHEMA_TYPES:
result.update({
'headline': item.get('headline'),
'author': _extract_schema_author(item),
'date_published': item.get('datePublished'),
'date_modified': item.get('dateModified'),
'description': item.get('description'),
'publisher': _extract_schema_publisher(item.get('publisher', {})),
'keywords': item.get('keywords', ''),
'section': item.get('articleSection', ''),
})
except (json.JSONDecodeError, AttributeError):
continue
return result
def _extract_schema_author(item: dict) -> Optional[str]:
author = item.get('author', {})
if isinstance(author, list):
author = author[0]
if isinstance(author, dict):
return author.get('name')
return str(author) if author else None
def _extract_schema_publisher(publisher: dict) -> Optional[str]:
if isinstance(publisher, dict):
return publisher.get('name')
return None
4.3 付费墙检测
def detect_paywall(html: str, text: str) -> dict:
"""Detects paywall type and available content."""
soup = BeautifulSoup(html, 'html.parser')
paywall_signals = [
bool(soup.find(class_=re.compile(r'paywall|premium|subscriber|locked', re.I))),
bool(soup.find(attrs={'data-paywall': True})),
bool(soup.find(id=re.compile(r'paywall|premium', re.I))),
]
paywall_text_patterns = [
r'assine para (ler|continuar|ver)',
r'conte.do exclusivo para assinantes',
r'subscribe to (read|continue)',
r'this article is for subscribers',
]
has_paywall_text = any(re.search(p, text, re.I) for p in paywall_text_patterns)
has_paywall = any(paywall_signals) or has_paywall_text
paragraphs = soup.find_all('p')
visible = [p for p in paragraphs
if 'display:none' not in p.get('style', '')
and len(p.get_text()) > 50]
return {
'has_paywall': has_paywall,
'type': 'soft' if (has_paywall and len(visible) >= 2) else
'hard' if has_paywall else 'none',
'available_paragraphs': len(visible),
}
付费墙处理:
- 硬付费墙:内容永远不会发送到客户端。提取预览(标题、导语、元数据)。在输出中标记
paywall: "hard"。 - 软付费墙:内容存在于 DOM 中,但被 CSS/JS 隐藏。使用 Playwright 移除付费墙覆盖层并显示段落。
- 无付费墙:正常进行。
阶段 5:实体提取(LLM)
仅对清理后的文本(阶段 3 的输出)使用 LLM。切勿传递原始 HTML——这会浪费 tokens 并降低精度。
5.1 单篇文章提取
import json, time, re
import requests as req
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY", "")
OPENROUTER_ENDPOINT = "https://openrouter.ai/api/v1/chat/completions"
def extract_entities_llm(text: str, metadata: dict) -> dict:
"""Extracts entities from a news article using LLM."""
text_sample = text[:4000] if len(text) > 4000 else text
prompt = f"""You are a news entity extractor. Analyze the text below and extract:
TITLE: {metadata.get('title', 'N/A')}
DATE: {metadata.get('date', 'N/A')}
TEXT:
{text_sample}
Respond ONLY with valid JSON, no markdown, in this format:
{{
"people": [
{{"name": "Full Name", "role": "Role/Title", "context": "One sentence about their role in the article"}}
],
"organizations": [
{{"name": "Org Name", "type": "company|government|ngo|other", "context": "role in article"}}
],
"locations": [
{{"name": "Location Name", "type": "city|state|country|address", "context": "mention"}}
],
"events": [
{{"name": "Event", "date": "date if available", "description": "brief description"}}
],
"relationships": [
{{"subject": "Entity A", "relation": "relation type", "object": "Entity B"}}
]
}}"""
try:
response = req.post(
OPENROUTER_ENDPOINT,
headers={
"Authorization": f"Bearer {OPENROUTER_API_KEY}",
"Content-Type": "application/json",
},
json={
"model": "google/gemini-2.5-flash-lite",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 2000,
"temperature": 0.1, # low for structured extraction
},
timeout=30,
)
response.raise_for_status()
content = response.json()['choices'][0]['message']['content']
content = re.sub(r'^```json\s*|\s*```$', '', content.strip())
return json.loads(content)
except (json.JSONDecodeError, KeyError, req.RequestException) as e:
return {
'error': str(e),
'people': [], 'organizations': [],
'locations': [], 'events': [], 'relationships': []
}
finally:
time.sleep(0.3) # rate limiting between calls
5.2 分块提取(长文章)
def extract_entities_chunked(text: str, metadata: dict) -> dict:
"""For long articles, extract entities per chunk and merge with deduplication."""
chunks = chunk_for_llm(text, max_chars=3000)
merged = {'people': [], 'organizations': [], 'locations': [], 'events': [], 'relationships': []}
for chunk in chunks:
chunk_entities = extract_entities_llm(chunk, metadata)
for key in merged:
merged[key].extend(chunk_entities.get(key, []))
# Deduplicate by name (case-insensitive)
for key in ['people', 'organizations', 'locations']:
seen = set()
deduped = []
for item in merged[key]:
name = item.get('name', '').lower().strip()
if name and name not in seen:
seen.add(name)
deduped.append(item)
merged[key] = deduped
return merged
5.3 推荐的大语言模型(通过OpenRouter)
| 模型 | 速度 | 成本 | 质量(葡萄牙语-巴西) | 使用场景 |
|---|---|---|---|---|
google/gemini-2.5-flash-lite | 非常快 | 非常低 | 很好 | 批量提取 |
google/gemini-2.5-flash | 快 | 低 | 优秀 | 复杂文章 |
anthropic/claude-haiku-4-5 | 快 | 中等 | 优秀 | 高精度 |
openai/gpt-4o-mini | 中等 | 中等 | 很好 | 替代方案 |
始终使用温度参数:0.1用于结构化提取。更高的值会产生虚构的实体。
速率限制和反机器人措施
按域名指数退避
import time, random
class RateLimiter:
def __init__(self, base_delay: float = 0.5, max_delay: float = 30.0):
self.base_delay = base_delay
self.max_delay = max_delay
self._attempts: dict[str, int] = {}
def wait(self, domain: str):
attempts = self._attempts.get(domain, 0)
delay = min(self.base_delay * (2 ** attempts), self.max_delay)
delay *= random.uniform(0.8, 1.2) # jitter +/-20%
time.sleep(delay)
def on_success(self, domain: str):
self._attempts[domain] = 0
def on_failure(self, domain: str):
self._attempts[domain] = self._attempts.get(domain, 0) + 1
轮换用户代理
USER_AGENTS = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
]
增量保存和检查点
切勿等到处理完所有URL后再保存。处理中途崩溃可能导致数小时的工作丢失。
import json
from pathlib import Path
from datetime import datetime
def save_incremental(results: list, output_path: Path, every: int = 50):
"""Saves results every N articles processed."""
if len(results) % every == 0:
output_path.write_text(json.dumps(results, ensure_ascii=False, indent=2))
def load_checkpoint(output_path: Path) -> tuple[list, set]:
"""Loads checkpoint and returns (results, already-processed URLs)."""
if output_path.exists():
results = json.loads(output_path.read_text())
processed_urls = {r['url'] for r in results}
return results, processed_urls
return [], set()
输出目录结构
output/
├── {domain}/
│ ├── articles_YYYY-MM-DD.json # full articles with text
│ ├── entities_YYYY-MM-DD.json # entities only (for quick analysis)
│ └── failed_YYYY-MM-DD.json # failed URLs (for retry)
结果模式
每个结果必须包含质量和来源元数据:
def build_result(url: str, content: dict, entities: dict, method: str) -> dict:
return {
'url': url,
'method': method, # static|playwright|scrapy|failed
'paywall': content.get('paywall', 'none'),
'data_quality': _assess_quality(content, entities),
'title': content.get('title'),
'author': content.get('author'),
'date_published': content.get('date_published'),
'word_count': len((content.get('text') or '').split()),
'text': content.get('text'),
'entities': entities,
'schema': content.get('schema', {}),
'crawled_at': datetime.now().isoformat(),
}
def _assess_quality(content: dict, entities: dict) -> str:
text = content.get('text') or ''
has_text = len(text.split()) >= 100
has_entities = any(entities.get(k) for k in ['people', 'organizations'])
has_meta = bool(content.get('title') and content.get('date_published'))
if has_text and has_entities and has_meta:
return 'high'
elif has_text or has_entities:
return 'medium'
return 'low'
Python依赖项
pip install \
requests \
beautifulsoup4 \
lxml html5lib \
scrapy \
playwright \
trafilatura \
pyyaml \
python-dateutil
# Chromium browser for Playwright
playwright install chromium
| 库 | 最低版本 | 职责 |
|---|---|---|
requests | 2.31+ | 静态HTTP、API调用 |
beautifulsoup4 | 4.12+ | 容错的HTML解析 |
lxml | 4.9+ | 稳健的替代解析器 |
html5lib | 1.1+ | 超容错解析器(针对损坏的HTML) |
scrapy | 2.11+ | 大规模并行爬取 |
playwright | 1.40+ | JS/SPA渲染 |
trafilatura | 1.8+ | 文章提取(去除模板内容) |
pyyaml | 6.0+ | 声明式提取配置 |
python-dateutil | 2.9+ | 多格式日期解析 |
最佳实践(应做事项)
- 级联方法:始终优先尝试最轻量级的方法(静态 -> Playwright)
- 增量保存:每处理50篇文章保存一次,避免因崩溃丢失进度
- 断点续传模式:开始前检查已处理的URL(
加载检查点) - 速率限制:同一域名请求间隔至少0.5秒;失败时采用指数退避策略
- 文档质量:每个结果必须包含
数据质量与方法信息 - 关注点分离:抓取 -> 清洗 -> 实体提取(严禁混合处理)
- 声明式配置:使用YAML配置CSS选择器,避免硬编码Python
- 优雅降级:若LLM处理失败,返回带
错误信息领域——永远不要抛出未处理的异常 - 为LLM清理文本:始终传递提取并规范化的文本,切勿传递原始HTML
反模式(应避免)
- 向LLM传递原始HTML(浪费令牌,降低实体精确度)
- 仅使用正则表达式进行实体提取(对自然文本变化适应性差)
- 在Python中硬编码CSS选择器(网站布局频繁变动)
- 忽略编码问题(UTF-8与Latin-1编码会导致数据静默损坏)
- 无限重试(应使用带最大尝试次数的指数退避策略)
- 保存前处理所有页面(崩溃时可能丢失全部数据)
- 混合使用未明确规范化的评分尺度(例如0-1与0-100分制混用)
- 在Playwright中为延迟加载内容使用
wait_until='load'(应改用'networkidle')
安全规则
- 未经用户明确授权,绝对不要抓取需要身份验证的页面。
- 始终遵守
robots.txt(Scrapy默认会处理;对于requests/Playwright,请手动检查)。 - 务必实施速率限制——对同一域名的请求之间至少间隔0.5秒。
- 切勿在生成的脚本中存储API密钥——始终使用
os.environ.get()。 - 切勿绕过硬付费墙——仅提取公开可用的内容。
- 对于软付费墙,仅揭示已发送给客户端的内容(仅限DOM操作,不绕过服务器端)。


微信扫一扫,打赏作者吧~