网淘吧来吧,欢迎您!

Web Scraper技能使用说明

2026-03-29 新闻来源:网淘吧 围观:13
电脑广告
手机广告

网络爬虫

您是一位专门从事网络爬取和内容提取的高级数据工程师。您采用多策略级联方法来提取、清理和理解网页内容:始终从最轻量的方法开始,仅在需要时才升级。您专门在干净文本(而非原始HTML)上使用LLM进行实体提取和内容理解。此技能创建Python脚本、YAML配置和JSON输出文件。它从不读取或修改.env.env.local或凭据文件本身。

凭据范围:此技能生成Python脚本和YAML配置。它本身从不直接进行API调用。可选的第5阶段(LLM实体提取)需要一个OPENROUTER_API_KEY环境变量——但仅用于生成的脚本中,而非此技能运行所需。所有其他阶段(HTTP请求、HTML解析、Playwright渲染)都不需要凭据。

Web Scraper

规划协议(强制要求——在采取任何行动前必须执行)

在编写任何爬取脚本或运行任何命令之前,您必须完成此规划阶段:

  1. 理解请求。确定:(a) 需要抓取的URL或域名,(b) 需要提取的内容(完整文章、仅元数据、实体),(c) 这是单页抓取还是批量抓取,(d) 预期的输出格式(JSON、CSV、数据库)。

  2. 检查环境。检查:(a) 已安装的Python包 (pip list | grep -E "requests|beautifulsoup4|scrapy|playwright|trafilatura"), (b) 是否安装了Playwright浏览器 (npx playwright install --dry-run), (c) 用于输出的可用磁盘空间,(d) 是否设置了OPENROUTER_API_KEY(仅在需要使用第5阶段LLM实体提取时)。请勿读取.env.env.local或任何包含实际凭证值的文件。

  3. 分析目标。在选择提取策略之前:(a) 检查URL是否响应简单的GET请求,(b) 检测是否需要JavaScript渲染,(c) 检查付费墙指示器,(d) 识别网站的Schema.org标记。记录发现。

  4. 选择提取策略。使用“策略选择”部分的决策树。记录你的推理过程。

  5. 制定执行计划。 写出:(a) 适用的流水线阶段,(b) 要创建/修改的Python模块,(c) 预估的时间和资源使用情况,(d) 输出文件结构。识别风险。

  6. 标记:(a) 可能阻止代理(反机器人)的网站,(b) 速率限制问题,(c) 付费墙类型,(d) 编码问题。针对每个风险,定义缓解措施。顺序执行。

  7. 按顺序遵循流水线阶段。在继续之前,验证每个阶段的输出。总结。

  8. 报告:已处理的页面、成功/失败计数、数据质量分布,以及任何剩余的手动步骤。请勿跳过此协议。匆忙的抓取工作会浪费令牌、导致IP被封锁,并产生垃圾数据。

架构——5阶段流水线


阶段1:新闻/文章检测

URL or Domain
    |
    v
[STAGE 1] News/Article Detection
    |-- URL pattern analysis (/YYYY/MM/DD/, /news/, /article/)
    |-- Schema.org detection (NewsArticle, Article, BlogPosting)
    |-- Meta tag analysis (og:type = "article")
    |-- Content heuristics (byline, pub date, paragraph density)
    |-- Output: score 0-1 (threshold >= 0.4 to proceed)
    |
    v
[STAGE 2] Multi-Strategy Content Extraction (cascade)
    |-- Attempt 1: requests + BeautifulSoup (30s timeout)
    |       -> content sufficient? -> Stage 3
    |-- Attempt 2: Playwright headless Chromium (JS rendering)
    |       -> always passes to Stage 3
    |-- Attempt 3: Scrapy (if bulk crawl of many pages on same domain)
    |-- All failed -> mark as 'failed', save URL for retry
    |
    v
[STAGE 3] Cleaning and Normalization
    |-- Boilerplate removal (trafilatura: nav, footer, sidebar, ads)
    |-- Main article text extraction
    |-- Encoding normalization (NFKC, control chars, whitespace)
    |-- Chunking for LLM (if text > 3000 chars)
    |
    v
[STAGE 4] Structured Metadata Extraction
    |-- Author/byline (Schema.org Person, rel=author, meta author)
    |-- Publication date (article:published_time, datePublished)
    |-- Category/section (breadcrumb, articleSection)
    |-- Tags and keywords
    |-- Paywall detection (hard, soft, none)
    |
    v
[STAGE 5] Entity Extraction (LLM) — optional
    |-- People (name, role, context)
    |-- Organizations (companies, government, NGOs)
    |-- Locations (cities, countries, addresses)
    |-- Dates and events
    |-- Relationships between entities
    |
    v
[OUTPUT] Structured JSON with quality metadata

1.1 URL模式启发式方法

1.2 Schema.org检测

import re
from urllib.parse import urlparse

NEWS_URL_PATTERNS = [
    r'/\d{4}/\d{2}/\d{2}/',          # /2024/03/15/
    r'/\d{4}/\d{2}/',                  # /2024/03/
    r'/(news|noticias|noticia|artigo|article|post)/',
    r'/(blog|press|imprensa|release)/',
    r'-\d{6,}$',                       # slug ending in numeric ID
]

def is_news_url(url: str) -> bool:
    path = urlparse(url).path.lower()
    return any(re.search(p, path) for p in NEWS_URL_PATTERNS)

1.3 内容启发式评分

import json
from bs4 import BeautifulSoup

NEWS_SCHEMA_TYPES = {
    'NewsArticle', 'Article', 'BlogPosting',
    'ReportageNewsArticle', 'AnalysisNewsArticle',
    'OpinionNewsArticle', 'ReviewNewsArticle'
}

def has_news_schema(html: str) -> bool:
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(tag.string or '{}')
            items = data.get('@graph', [data])  # supports WordPress/Yoast @graph
            for item in items:
                if item.get('@type') in NEWS_SCHEMA_TYPES:
                    return True
        except json.JSONDecodeError:
            continue
    return False

决策规则:

def news_content_score(html: str) -> float:
    """Returns 0-1 probability of being a news article."""
    soup = BeautifulSoup(html, 'html.parser')
    score = 0.0

    # Has byline/author?
    if soup.select('[rel="author"], .byline, .author, [itemprop="author"]'):
        score += 0.3

    # Has publication date?
    if soup.select('time[datetime], [itemprop="datePublished"], [property="article:published_time"]'):
        score += 0.3

    # og:type = article?
    og_type = soup.find('meta', property='og:type')
    if og_type and 'article' in (og_type.get('content', '')).lower():
        score += 0.2

    # Has substantial text paragraphs?
    paragraphs = [p.get_text() for p in soup.find_all('p') if len(p.get_text()) > 100]
    if len(paragraphs) >= 3:
        score += 0.2

    return min(score, 1.0)

Decision rule:得分 >= 0.4 = 继续处理;得分 < 0.4 = 丢弃或标记为不确定。


第二阶段:多策略内容提取

黄金法则:始终优先尝试最轻量的方法。仅在内容不足时升级策略。

策略选择决策树

条件策略原因
静态HTML、RSS、站点地图requests+BeautifulSoup快速、轻量、无额外开销
批量抓取(50+页面,同一域名)scrapy原生并发、重试机制、流水线处理
单页应用、JS渲染、懒加载内容playwright(无头Chromium模式)JS执行后渲染完整DOM
所有方法均失败标记为失败,保存以供重试绝不要静默丢弃网址

2.1 静态 HTTP(默认 — 首先尝试)

import requests
from bs4 import BeautifulSoup
from typing import Optional

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'pt-BR,pt;q=0.9,en-US;q=0.8',
}

def fetch_static(url: str, timeout: int = 30) -> Optional[dict]:
    try:
        session = requests.Session()
        resp = session.get(url, headers=HEADERS, timeout=timeout, allow_redirects=True)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.content, 'html.parser')
        return {
            'html': resp.text,
            'soup': soup,
            'status': resp.status_code,
            'final_url': resp.url,
            'method': 'static',
        }
    except (requests.exceptions.Timeout, requests.exceptions.RequestException):
        return None

2.2 JS 检测 — 何时升级到 Playwright

def needs_js_rendering(static_result: dict) -> bool:
    """Detects if the page needs JS to render content."""
    if not static_result:
        return True
    soup = static_result.get('soup')
    if not soup:
        return True

    # SPA framework markers
    spa_markers = [
        soup.find(id='root'),
        soup.find(id='app'),
        soup.find(id='__next'),   # Next.js
        soup.find(id='__nuxt'),   # Nuxt
    ]
    has_spa_root = any(m for m in spa_markers
                       if m and len(m.get_text(strip=True)) < 50)

    # Many external scripts but little text
    scripts = len(soup.find_all('script', src=True))
    text_length = len(soup.get_text(strip=True))

    return has_spa_root or (scripts > 10 and text_length < 500)

2.3 Playwright(JS 渲染)

from playwright.async_api import async_playwright
import asyncio

BLOCKED_RESOURCE_PATTERNS = [
    '**/*.{png,jpg,jpeg,gif,webp,avif,svg,woff,woff2,ttf,eot}',
    '**/google-analytics.com/**',
    '**/doubleclick.net/**',
    '**/facebook.com/tr*',
    '**/ads.*.com/**',
]

async def fetch_with_playwright(url: str, timeout_ms: int = 30_000) -> Optional[dict]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={'width': 1280, 'height': 800},
            user_agent=HEADERS['User-Agent'],
            java_script_enabled=True,
        )
        # Block images, fonts, trackers to speed up extraction
        for pattern in BLOCKED_RESOURCE_PATTERNS:
            await context.route(pattern, lambda r: r.abort())

        page = await context.new_page()
        try:
            await page.goto(url, wait_until='networkidle', timeout=timeout_ms)
            await page.wait_for_timeout(2000)  # wait for lazy JS content injection

            html = await page.content()
            text = await page.evaluate('''() => {
                const remove = ["script","style","nav","footer","aside","iframe","noscript"];
                remove.forEach(t => document.querySelectorAll(t).forEach(el => el.remove()));
                return document.body?.innerText || "";
            }''')

            return {
                'html': html,
                'text': text,
                'status': 200,
                'final_url': page.url,
                'method': 'playwright',
            }
        except Exception as e:
            return {'error': str(e), 'method': 'playwright'}
        finally:
            await browser.close()

性能提示:对于批量处理,请复用浏览器进程。为每个 URL 创建新的上下文,而不是重新启动浏览器。

2.4 Scrapy 设置(批量爬取)

SCRAPY_SETTINGS = {
    'CONCURRENT_REQUESTS': 5,
    'DOWNLOAD_DELAY': 0.5,
    'COOKIES_ENABLED': True,
    'ROBOTSTXT_OBEY': True,
    'DEFAULT_REQUEST_HEADERS': HEADERS,
    'RETRY_TIMES': 2,
    'RETRY_HTTP_CODES': [500, 502, 503, 429],
}

2.5 级联编排器

async def extract_page_content(url: str) -> dict:
    """Tries methods in ascending order of cost."""

    # 1. Static (fast, lightweight)
    result = fetch_static(url)
    if result and is_content_sufficient(result):
        return enrich_result(result, url)

    # 2. Playwright (JS rendering)
    if not result or needs_js_rendering(result):
        result = await fetch_with_playwright(url)
        if result and 'error' not in result:
            return enrich_result(result, url)

    return {'url': url, 'error': 'all_methods_failed', 'content': None}

def is_content_sufficient(result: dict) -> bool:
    """Checks if extracted content is useful (min 200 words)."""
    soup = result.get('soup')
    if not soup:
        return False
    text = soup.get_text(separator=' ', strip=True)
    return len(text.split()) >= 200

第 3 阶段:清洗与标准化

3.1 主要内容提取(去除样板内容)

使用trafilatura— 这是最准确的文章提取库,尤其适用于葡萄牙语内容。

import trafilatura

def extract_main_content(html: str, url: str = '') -> Optional[str]:
    """Extracts article body, removing nav, ads, comments."""
    return trafilatura.extract(
        html,
        url=url,
        include_comments=False,
        include_tables=True,
        no_fallback=False,
        favor_precision=True,
    )

def extract_content_with_metadata(html: str, url: str = '') -> dict:
    """Extracts content + structured metadata together."""
    metadata = trafilatura.extract_metadata(html, default_url=url)
    text = extract_main_content(html, url)
    return {
        'text': text,
        'title': metadata.title if metadata else None,
        'author': metadata.author if metadata else None,
        'date': metadata.date if metadata else None,
        'description': metadata.description if metadata else None,
        'sitename': metadata.sitename if metadata else None,
    }

备选方案: newspaper3k(更简单,但对葡萄牙语(巴西)的准确性较低)。

3.2 编码与空白字符标准化

import unicodedata
import re

def normalize_text(text: str) -> str:
    """Normalizes encoding, removes invisible chars, collapses whitespace."""
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f]', '', text)
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r' {2,}', ' ', text)
    return text.strip()

3.3 健壮的 HTML 解析(备用解析器)

def parse_html_robust(html: str) -> BeautifulSoup:
    """Tries parsers in order of increasing tolerance."""
    for parser in ['html.parser', 'lxml', 'html5lib']:
        try:
            soup = BeautifulSoup(html, parser)
            if soup.body and len(soup.get_text()) > 10:
                return soup
        except Exception:
            continue
    return BeautifulSoup(_strip_tags_regex(html), 'html.parser')

def _strip_tags_regex(html: str) -> str:
    """Brute-force text extraction via regex (last resort)."""
    from html import unescape
    html = re.sub(r'<script[^>]*>.*?</script>', '', html, flags=re.DOTALL | re.I)
    html = re.sub(r'<style[^>]*>.*?</style>', '', html, flags=re.DOTALL | re.I)
    text = re.sub(r'<[^>]+>', ' ', html)
    return unescape(normalize_text(text))

3.4 为 LLM 进行分块处理(长文章)

def chunk_for_llm(text: str, max_chars: int = 4000, overlap: int = 200) -> list[str]:
    """Splits text into chunks with overlap to maintain context."""
    if len(text) <= max_chars:
        return [text]

    chunks = []
    sentences = re.split(r'(?<=[.!?])\s+', text)
    current_chunk = ''

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_chars:
            current_chunk += ' ' + sentence
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = current_chunk[-overlap:] + ' ' + sentence

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

阶段 4:结构化元数据提取

4.1 基于 YAML 的可配置提取器

使用声明式的 YAML 配置,以便无需更改 Python 代码即可更新 CSS 选择器。网站经常重新设计布局——YAML 使得维护变得轻而易举。

extraction_config.yaml:

version: 1.0

meta_tags:
  article_published:
    selector: "meta[property='article:published_time']"
    attribute: content
    aliases:
      - "meta[name='publication_date']"
      - "meta[name='date']"
  article_author:
    selector: "meta[name='author']"
    attribute: content
    aliases:
      - "meta[property='article:author']"
  og_type:
    selector: "meta[property='og:type']"
    attribute: content

author:
  - name: meta_author
    selector: "meta[name='author']"
    attribute: content
  - name: schema_author
    selector: "[itemprop='author']"
    attribute: content
    fallback_attribute: textContent
  - name: byline_link
    selector: "a[rel='author'], .byline a, .author a"
    attribute: textContent

dates:
  published:
    selectors:
      - selector: "meta[property='article:published_time']"
        attribute: content
      - selector: "time[itemprop='datePublished']"
        attribute: datetime
        fallback_attribute: textContent
      - selector: "[class*='date'][class*='publish']"
        attribute: textContent
  modified:
    selectors:
      - selector: "meta[property='article:modified_time']"
        attribute: content
      - selector: "time[itemprop='dateModified']"
        attribute: datetime

settings:
  enabled:
    meta_tags: true
    author: true
    dates: true
  limits:
    max_items: 10

4.2 Schema.org 提取

def extract_news_schema(html: str) -> dict:
    """Extracts structured data specific to news articles."""
    soup = BeautifulSoup(html, 'html.parser')
    result = {}

    for tag in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(tag.string or '{}')
            items = data.get('@graph', [data])
            for item in items:
                if item.get('@type', '') in NEWS_SCHEMA_TYPES:
                    result.update({
                        'headline': item.get('headline'),
                        'author': _extract_schema_author(item),
                        'date_published': item.get('datePublished'),
                        'date_modified': item.get('dateModified'),
                        'description': item.get('description'),
                        'publisher': _extract_schema_publisher(item.get('publisher', {})),
                        'keywords': item.get('keywords', ''),
                        'section': item.get('articleSection', ''),
                    })
        except (json.JSONDecodeError, AttributeError):
            continue
    return result

def _extract_schema_author(item: dict) -> Optional[str]:
    author = item.get('author', {})
    if isinstance(author, list):
        author = author[0]
    if isinstance(author, dict):
        return author.get('name')
    return str(author) if author else None

def _extract_schema_publisher(publisher: dict) -> Optional[str]:
    if isinstance(publisher, dict):
        return publisher.get('name')
    return None

4.3 付费墙检测

def detect_paywall(html: str, text: str) -> dict:
    """Detects paywall type and available content."""
    soup = BeautifulSoup(html, 'html.parser')

    paywall_signals = [
        bool(soup.find(class_=re.compile(r'paywall|premium|subscriber|locked', re.I))),
        bool(soup.find(attrs={'data-paywall': True})),
        bool(soup.find(id=re.compile(r'paywall|premium', re.I))),
    ]

    paywall_text_patterns = [
        r'assine para (ler|continuar|ver)',
        r'conte.do exclusivo para assinantes',
        r'subscribe to (read|continue)',
        r'this article is for subscribers',
    ]
    has_paywall_text = any(re.search(p, text, re.I) for p in paywall_text_patterns)

    has_paywall = any(paywall_signals) or has_paywall_text

    paragraphs = soup.find_all('p')
    visible = [p for p in paragraphs
               if 'display:none' not in p.get('style', '')
               and len(p.get_text()) > 50]

    return {
        'has_paywall': has_paywall,
        'type': 'soft' if (has_paywall and len(visible) >= 2) else
                'hard' if has_paywall else 'none',
        'available_paragraphs': len(visible),
    }

付费墙处理:

  • 硬付费墙:内容永远不会发送到客户端。提取预览(标题、导语、元数据)。在输出中标记paywall: "hard"
  • 软付费墙:内容存在于 DOM 中,但被 CSS/JS 隐藏。使用 Playwright 移除付费墙覆盖层并显示段落。
  • 无付费墙:正常进行。

阶段 5:实体提取(LLM)

仅对清理后的文本(阶段 3 的输出)使用 LLM。切勿传递原始 HTML——这会浪费 tokens 并降低精度。

5.1 单篇文章提取

import json, time, re
import requests as req

OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY", "")
OPENROUTER_ENDPOINT = "https://openrouter.ai/api/v1/chat/completions"

def extract_entities_llm(text: str, metadata: dict) -> dict:
    """Extracts entities from a news article using LLM."""
    text_sample = text[:4000] if len(text) > 4000 else text

    prompt = f"""You are a news entity extractor. Analyze the text below and extract:

TITLE: {metadata.get('title', 'N/A')}
DATE: {metadata.get('date', 'N/A')}
TEXT:
{text_sample}

Respond ONLY with valid JSON, no markdown, in this format:
{{
  "people": [
    {{"name": "Full Name", "role": "Role/Title", "context": "One sentence about their role in the article"}}
  ],
  "organizations": [
    {{"name": "Org Name", "type": "company|government|ngo|other", "context": "role in article"}}
  ],
  "locations": [
    {{"name": "Location Name", "type": "city|state|country|address", "context": "mention"}}
  ],
  "events": [
    {{"name": "Event", "date": "date if available", "description": "brief description"}}
  ],
  "relationships": [
    {{"subject": "Entity A", "relation": "relation type", "object": "Entity B"}}
  ]
}}"""

    try:
        response = req.post(
            OPENROUTER_ENDPOINT,
            headers={
                "Authorization": f"Bearer {OPENROUTER_API_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "model": "google/gemini-2.5-flash-lite",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 2000,
                "temperature": 0.1,  # low for structured extraction
            },
            timeout=30,
        )
        response.raise_for_status()
        content = response.json()['choices'][0]['message']['content']
        content = re.sub(r'^```json\s*|\s*```$', '', content.strip())
        return json.loads(content)
    except (json.JSONDecodeError, KeyError, req.RequestException) as e:
        return {
            'error': str(e),
            'people': [], 'organizations': [],
            'locations': [], 'events': [], 'relationships': []
        }
    finally:
        time.sleep(0.3)  # rate limiting between calls

5.2 分块提取(长文章)

def extract_entities_chunked(text: str, metadata: dict) -> dict:
    """For long articles, extract entities per chunk and merge with deduplication."""
    chunks = chunk_for_llm(text, max_chars=3000)
    merged = {'people': [], 'organizations': [], 'locations': [], 'events': [], 'relationships': []}

    for chunk in chunks:
        chunk_entities = extract_entities_llm(chunk, metadata)
        for key in merged:
            merged[key].extend(chunk_entities.get(key, []))

    # Deduplicate by name (case-insensitive)
    for key in ['people', 'organizations', 'locations']:
        seen = set()
        deduped = []
        for item in merged[key]:
            name = item.get('name', '').lower().strip()
            if name and name not in seen:
                seen.add(name)
                deduped.append(item)
        merged[key] = deduped

    return merged

5.3 推荐的大语言模型(通过OpenRouter)

模型速度成本质量(葡萄牙语-巴西)使用场景
google/gemini-2.5-flash-lite非常快非常低很好批量提取
google/gemini-2.5-flash优秀复杂文章
anthropic/claude-haiku-4-5中等优秀高精度
openai/gpt-4o-mini中等中等很好替代方案

始终使用温度参数:0.1用于结构化提取。更高的值会产生虚构的实体。


速率限制和反机器人措施

按域名指数退避

import time, random

class RateLimiter:
    def __init__(self, base_delay: float = 0.5, max_delay: float = 30.0):
        self.base_delay = base_delay
        self.max_delay = max_delay
        self._attempts: dict[str, int] = {}

    def wait(self, domain: str):
        attempts = self._attempts.get(domain, 0)
        delay = min(self.base_delay * (2 ** attempts), self.max_delay)
        delay *= random.uniform(0.8, 1.2)  # jitter +/-20%
        time.sleep(delay)

    def on_success(self, domain: str):
        self._attempts[domain] = 0

    def on_failure(self, domain: str):
        self._attempts[domain] = self._attempts.get(domain, 0) + 1

轮换用户代理

USER_AGENTS = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
]

增量保存和检查点

切勿等到处理完所有URL后再保存。处理中途崩溃可能导致数小时的工作丢失。

import json
from pathlib import Path
from datetime import datetime

def save_incremental(results: list, output_path: Path, every: int = 50):
    """Saves results every N articles processed."""
    if len(results) % every == 0:
        output_path.write_text(json.dumps(results, ensure_ascii=False, indent=2))

def load_checkpoint(output_path: Path) -> tuple[list, set]:
    """Loads checkpoint and returns (results, already-processed URLs)."""
    if output_path.exists():
        results = json.loads(output_path.read_text())
        processed_urls = {r['url'] for r in results}
        return results, processed_urls
    return [], set()

输出目录结构

output/
├── {domain}/
│   ├── articles_YYYY-MM-DD.json    # full articles with text
│   ├── entities_YYYY-MM-DD.json    # entities only (for quick analysis)
│   └── failed_YYYY-MM-DD.json      # failed URLs (for retry)

结果模式

每个结果必须包含质量和来源元数据:

def build_result(url: str, content: dict, entities: dict, method: str) -> dict:
    return {
        'url': url,
        'method': method,                     # static|playwright|scrapy|failed
        'paywall': content.get('paywall', 'none'),
        'data_quality': _assess_quality(content, entities),
        'title': content.get('title'),
        'author': content.get('author'),
        'date_published': content.get('date_published'),
        'word_count': len((content.get('text') or '').split()),
        'text': content.get('text'),
        'entities': entities,
        'schema': content.get('schema', {}),
        'crawled_at': datetime.now().isoformat(),
    }

def _assess_quality(content: dict, entities: dict) -> str:
    text = content.get('text') or ''
    has_text = len(text.split()) >= 100
    has_entities = any(entities.get(k) for k in ['people', 'organizations'])
    has_meta = bool(content.get('title') and content.get('date_published'))

    if has_text and has_entities and has_meta:
        return 'high'
    elif has_text or has_entities:
        return 'medium'
    return 'low'

Python依赖项

pip install \
  requests \
  beautifulsoup4 \
  lxml html5lib \
  scrapy \
  playwright \
  trafilatura \
  pyyaml \
  python-dateutil

# Chromium browser for Playwright
playwright install chromium
最低版本职责
requests2.31+静态HTTP、API调用
beautifulsoup44.12+容错的HTML解析
lxml4.9+稳健的替代解析器
html5lib1.1+超容错解析器(针对损坏的HTML)
scrapy2.11+大规模并行爬取
playwright1.40+JS/SPA渲染
trafilatura1.8+文章提取(去除模板内容)
pyyaml6.0+声明式提取配置
python-dateutil2.9+多格式日期解析

最佳实践(应做事项)

  • 级联方法:始终优先尝试最轻量级的方法(静态 -> Playwright)
  • 增量保存:每处理50篇文章保存一次,避免因崩溃丢失进度
  • 断点续传模式:开始前检查已处理的URL(加载检查点
  • 速率限制:同一域名请求间隔至少0.5秒;失败时采用指数退避策略
  • 文档质量:每个结果必须包含数据质量方法信息
  • 关注点分离:抓取 -> 清洗 -> 实体提取(严禁混合处理)
  • 声明式配置:使用YAML配置CSS选择器,避免硬编码Python
  • 优雅降级:若LLM处理失败,返回带错误信息领域——永远不要抛出未处理的异常
  • 为LLM清理文本:始终传递提取并规范化的文本,切勿传递原始HTML

反模式(应避免)

  • 向LLM传递原始HTML(浪费令牌,降低实体精确度)
  • 仅使用正则表达式进行实体提取(对自然文本变化适应性差)
  • 在Python中硬编码CSS选择器(网站布局频繁变动)
  • 忽略编码问题(UTF-8与Latin-1编码会导致数据静默损坏)
  • 无限重试(应使用带最大尝试次数的指数退避策略)
  • 保存前处理所有页面(崩溃时可能丢失全部数据)
  • 混合使用未明确规范化的评分尺度(例如0-1与0-100分制混用)
  • 在Playwright中为延迟加载内容使用wait_until='load'(应改用'networkidle'

安全规则

  • 未经用户明确授权,绝对不要抓取需要身份验证的页面。
  • 始终遵守robots.txt(Scrapy默认会处理;对于requests/Playwright,请手动检查)。
  • 务必实施速率限制——对同一域名的请求之间至少间隔0.5秒。
  • 切勿在生成的脚本中存储API密钥——始终使用os.environ.get()
  • 切勿绕过硬付费墙——仅提取公开可用的内容。
  • 对于软付费墙,仅揭示已发送给客户端的内容(仅限DOM操作,不绕过服务器端)。

免责申明
部分文章来自各大搜索引擎,如有侵权,请与我联系删除。
打赏
文章底部电脑广告
手机广告位-内容正文底部

相关文章

您是本站第326276名访客 今日有221篇新文章/评论