Scrapling技能使用说明

2026-04-01 新闻来源：网淘吧围观:257

电脑广告

手机广告

Scrapling - 自适应网络爬取

“为现代网络提供轻松的网络爬取体验。”

致谢

核心库

代码仓库： https://github.com/D4Vinci/Scrapling
作者：D4Vinci (Karim Shoair)
许可证：BSD-3-Clause
文档： https://scrapling.readthedocs.io

API逆向工程方法论

GitHub： https://github.com/paoloanzn/free-solscan-api
X平台帖子： https://x.com/paoloanzn/status/2026361234032046319
作者：@paoloanzn
洞见：“网络爬取的80%是逆向工程”

安装

# Core library (parser only)
pip install scrapling

# With fetchers (HTTP + browser automation) - RECOMMENDED
pip install "scrapling[fetchers]"
scrapling install

# With shell (CLI tools) - RECOMMENDED
pip install "scrapling[shell]"

# With AI (MCP server) - OPTIONAL
pip install "scrapling[ai]"

# Everything
pip install "scrapling[all]"

# Browser for stealth/dynamic mode
playwright install chromium

# For Cloudflare bypass (advanced)
pip install cloudscraper

代理指令

何时使用Scrapling

在以下情况使用Scrapling：

Scrapling

从网站研究主题
从博客、新闻网站、文档中提取数据
使用Spider爬取多个页面
为摘要收集内容
从任何网站提取品牌数据
从网站逆向工程API

请勿用于：

X/Twitter（请使用 x-tweet-fetcher 技能）
需要登录的网站（除非提供了凭据）
付费内容（请遵守robots.txt）
服务条款禁止抓取的网站

快捷命令

1. 基本抓取（最常用）

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com')

# Extract content
title = page.css('h1::text').get()
paragraphs = page.css('p::text').getall()

2. 隐秘抓取（防机器人/Cloudflare）

from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch('https://example.com', headless=True, solve_cloudflare=True)

3. 动态抓取（完整浏览器自动化）

from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch('https://example.com', headless=True, network_idle=True)

4. 自适应解析（适应设计变更）

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com')

# First scrape - saves selectors
items = page.css('.product', auto_save=True)

# Later - if site changes, use adaptive=True to relocate
items = page.css('.product', adaptive=True)

5. Spider（多页面）

from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "demo"
    start_urls = ["https://example.com"]
    concurrent_requests = 3
    
    async def parse(self, response: Response):
        for item in response.css('.item'):
            yield {"item": item.css('h2::text').get()}
        
        # Follow links
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

MySpider().start()

6. CLI用法

# Simple fetch to file
scrapling extract get https://example.com content.html

# Stealthy fetch (bypass anti-bot)
scrapling extract stealthy-fetch https://example.com content.html

# Interactive shell
scrapling shell https://example.com

常见模式

提取文章内容

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com/article')

# Try multiple selectors for title
title = (
    page.css('[itemprop="headline"]::text').get() or
    page.css('article h1::text').get() or
    page.css('h1::text').get()
)

# Get paragraphs
content = page.css('article p::text, .article-body p::text').getall()

print(f"Title: {title}")
print(f"Paragraphs: {len(content)}")

研究多个页面

from scrapling.spiders import Spider, Response

class ResearchSpider(Spider):
    name = "research"
    start_urls = ["https://news.ycombinator.com"]
    concurrent_requests = 5
    
    async def parse(self, response: Response):
        for item in response.css('.titleline a::text').getall()[:10]:
            yield {"title": item, "source": "HN"}
        
        more = response.css('.morelink::attr(href)').get()
        if more:
            yield response.follow(more)

ResearchSpider().start()

爬取整个网站（简易模式）

通过跟踪内部链接自动爬取域名下的所有页面：

from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse

class EasyCrawl(Spider):
    """Auto-crawl all pages on a domain."""
    
    name = "easy_crawl"
    start_urls = ["https://example.com"]
    concurrent_requests = 3
    
    def __init__(self):
        super().__init__()
        self.visited = set()
    
    async def parse(self, response: Response):
        # Extract content
        yield {
            'url': response.url,
            'title': response.css('title::text').get(),
            'h1': response.css('h1::text').get(),
        }
        
        # Follow internal links (limit to 50 pages)
        if len(self.visited) >= 50:
            return
        
        self.visited.add(response.url)
        
        links = response.css('a::attr(href)').getall()[:20]
        for link in links:
            full_url = urljoin(response.url, link)
            if full_url not in self.visited:
                yield response.follow(full_url)

# Usage
result = EasyCrawl()
result.start()

站点地图爬取

从sitemap.xml爬取页面（备选链接发现方案）：

from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
import re

def get_sitemap_urls(url: str, max_urls: int = 100) -> list:
    """Extract URLs from sitemap.xml - also checks robots.txt."""
    
    parsed = urlparse(url)
    base_url = f"{parsed.scheme}://{parsed.netloc}"
    
    sitemap_urls = [
        f"{base_url}/sitemap.xml",
        f"{base_url}/sitemap-index.xml",
        f"{base_url}/sitemap_index.xml",
        f"{base_url}/sitemap-news.xml",
    ]
    
    all_urls = []
    
    # First check robots.txt for sitemap URL
    try:
        robots = Fetcher.get(f"{base_url}/robots.txt")
        if robots.status == 200:
            sitemap_in_robots = re.findall(r'Sitemap:\s*(\S+)', robots.text, re.IGNORECASE)
            for sm in sitemap_in_robots:
                sitemap_urls.insert(0, sm)
    except:
        pass
    
    # Try each sitemap location
    for sitemap_url in sitemap_urls:
        try:
            page = Fetcher.get(sitemap_url, timeout=10)
            if page.status != 200:
                continue
            
            text = page.text
            
            # Check if it's XML
            if '<?xml' in text or '<urlset' in text or '<sitemapindex' in text:
                urls = re.findall(r'<loc>([^<]+)</loc>', text)
                all_urls.extend(urls[:max_urls])
                print(f"Found {len(urls)} URLs in {sitemap_url}")
        except:
            continue
    
    return list(set(all_urls))[:max_urls]

def crawl_from_sitemap(domain_url: str, max_pages: int = 50):
    """Crawl pages from sitemap."""
    
    print(f"Fetching sitemap for {domain_url}...")
    urls = get_sitemap_urls(domain_url)
    
    if not urls:
        print("No sitemap found. Use EasyCrawl instead!")
        return []
    
    print(f"Found {len(urls)} URLs, crawling first {max_pages}...")
    
    results = []
    for url in urls[:max_pages]:
        try:
            page = Fetcher.get(url, timeout=10)
            results.append({
                'url': url,
                'status': page.status,
                'title': page.css('title::text').get(),
            })
        except Exception as e:
            results.append({'url': url, 'error': str(e)[:50]})
    
    return results

# Usage
print("=== Sitemap Crawl ===")
results = crawl_from_sitemap('https://example.com', max_pages=10)
for r in results[:3]:
    print(f"  {r.get('title', r.get('error', 'N/A'))}")

# Alternative: Easy crawl all links
print("\n=== Easy Crawl (Link Discovery) ===")
result = EasyCrawl(start_urls=["https://example.com"], max_pages=10).start()
print(f"Crawled {len(result.items)} pages")

Firecrawl风格爬取（两全其美）

借鉴Firecrawl的运行机制——结合站点地图发现与链接跟踪：

from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
import re

def firecrawl_crawl(url: str, max_pages: int = 50, use_sitemap: bool = True):
    """
    Firecrawl-style crawling:
    - use_sitemap=True: Discover URLs from sitemap first (default)
    - use_sitemap=False: Only follow HTML links (like sitemap:"skip")
    
    Matches Firecrawl's crawl behavior.
    """
    
    parsed = urlparse(url)
    domain = parsed.netloc
    
    # ========== Method 1: Sitemap Discovery ==========
    if use_sitemap:
        print(f"[Firecrawl] Discovering URLs from sitemap...")
        
        sitemap_urls = [
            f"{url.rstrip('/')}/sitemap.xml",
            f"{url.rstrip('/')}/sitemap-index.xml",
        ]
        
        all_urls = []
        
        # Try sitemaps
        for sm_url in sitemap_urls:
            try:
                page = Fetcher.get(sm_url, timeout=15)
                if page.status == 200:
                    # Handle bytes
                    text = page.body.decode('utf-8', errors='ignore') if isinstance(page.body, bytes) else str(page.body)
                    
                    if '<urlset' in text:
                        urls = re.findall(r'<loc>([^<]+)</loc>', text)
                        all_urls.extend(urls[:max_pages])
                        print(f"[Firecrawl] Found {len(urls)} URLs in {sm_url}")
            except:
                continue
        
        if all_urls:
            print(f"[Firecrawl] Total: {len(all_urls)} URLs from sitemap")
            
            # Crawl discovered URLs
            results = []
            for page_url in all_urls[:max_pages]:
                try:
                    page = Fetcher.get(page_url, timeout=15)
                    results.append({
                        'url': page_url,
                        'status': page.status,
                        'title': page.css('title::text').get() if page.status == 200 else None,
                    })
                except Exception as e:
                    results.append({'url': page_url, 'error': str(e)[:50]})
            
            return results
    
    # ========== Method 2: Link Discovery (sitemap: skip) ==========
    print(f"[Firecrawl] Sitemap skip - using link discovery...")
    
    class LinkCrawl(Spider):
        name = "firecrawl_link"
        start_urls = [url]
        concurrent_requests = 3
        
        def __init__(self):
            super().__init__()
            self.visited = set()
            self.domain = domain
            self.results = []
        
        async def parse(self, response: Response):
            if len(self.results) >= max_pages:
                return
            
            self.results.append({
                'url': response.url,
                'status': response.status,
                'title': response.css('title::text').get(),
            })
            
            # Follow internal links
            links = response.css('a::attr(href)').getall()[:20]
            for link in links:
                full_url = urljoin(response.url, link)
                parsed_link = urlparse(full_url)
                
                if parsed_link.netloc == self.domain and full_url not in self.visited:
                    self.visited.add(full_url)
                    if len(self.visited) < max_pages:
                        yield response.follow(full_url)
    
    result = LinkCrawl()
    result.start()
    return result.results

# Usage
print("=== Firecrawl-Style (sitemap: include) ===")
results = firecrawl_crawl('https://www.cloudflare.com', max_pages=5, use_sitemap=True)
print(f"Crawled: {len(results)} pages")

print("\n=== Firecrawl-Style (sitemap: skip) ===")
results = firecrawl_crawl('https://example.com', max_pages=5, use_sitemap=False)
print(f"Crawled: {len(results)} pages")

处理错误

from scrapling.fetchers import Fetcher, StealthyFetcher

try:
    page = Fetcher.get('https://example.com')
except Exception as e:
    # Try stealth mode
    page = StealthyFetcher.fetch('https://example.com', headless=True)
    
if page.status == 403:
    print("Blocked - try StealthyFetcher")
elif page.status == 200:
    print("Success!")

会话管理

from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate='chrome') as session:
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text').getall()

爬虫中的多会话类型

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "protected" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast", callback=self.parse)

高级解析与导航

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/')

# Multiple selection methods
quotes = page.css('.quote')           # CSS
quotes = page.xpath('//div[@class="quote"]')  # XPath
quotes = page.find_all('div', class_='quote')  # BeautifulSoup-style

# Navigation
first_quote = page.css('.quote')[0]
author = first_quote.css('.author::text').get()
parent = first_quote.parent

# Find similar elements
similar = first_quote.find_similar()

进阶：API逆向工程

“网络爬虫80%的工作是逆向工程。”

本节涵盖从网站直接发现和复制API的高级技巧——这些技巧常能揭示隐藏在付费API背后的“隐藏”数据。

1. API端点发现

许多网站通过客户端请求加载数据。使用浏览器开发者工具来发现它们：

步骤：

打开浏览器开发者工具（按F12键）
进入网络标签页
刷新页面
查找XHR或Fetch请求
检查端点是否返回JSON数据

需关注的内容：

指向/api/*端点的请求
包含结构化数据（JSON）的响应
免费和付费版块共用的相同端点

示例模式：

# Found in Network tab:
GET https://api.example.com/v1/users/transactions
Response: {"data": [...], "pagination": {...}}

2. JavaScript分析

授权令牌通常在客户端生成。可在.js文件中查找：

步骤：

在网络标签页中，查看发起者列
点击.js发出请求的文件
搜索认证头部名称（例如，sol-aut、Authorization、X-API-Key）
寻找生成令牌的函数

常见模式：

明文函数名：generateToken()、createAuthHeader()
混淆代码：直接搜索头部名称
随机字符串生成：Math.random()、crypto.getRandomValues()

3. 复制已发现的API

一旦你找到端点和认证模式：

import requests
import random
import string

def generate_auth_token():
    """Replicate discovered token generation logic."""
    chars = string.ascii_letters + string.digits
    token = ''.join(random.choice(chars) for _ in range(40))
    # Insert fixed string at random position
    fixed = "B9dls0fK"
    pos = random.randint(0, len(token))
    return token[:pos] + fixed + token[pos:]

def scrape_api_endpoint(url):
    """Hit discovered API endpoint with replicated auth."""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'application/json',
        'sol-aut': generate_auth_token(),  # Replicate discovered header
    }
    
    response = requests.get(url, headers=headers)
    return response.json()

4. Cloudscraper 绕过（Cloudflare）

对于受 Cloudflare 保护的端点，使用cloudscraper：

pip install cloudscraper

import cloudscraper

def create_scraper():
    """Create a cloudscraper session that bypasses Cloudflare."""
    scraper = cloudscraper.create_scraper(
        browser={
            'browser': 'chrome',
            'platform': 'windows',
            'desktop': True
        }
    )
    return scraper

# Usage
scraper = create_scraper()
response = scraper.get('https://api.example.com/endpoint')
data = response.json()

5. 完整 API 复制模式

import cloudscraper
import random
import string
import json

class APIReplicator:
    """Replicate discovered API from website."""
    
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = cloudscraper.create_scraper()
    
    def generate_token(self, pattern="random"):
        """Replicate discovered token generation."""
        if pattern == "solscan":
            # 40-char random + fixed string at random position
            chars = string.ascii_letters + string.digits
            token = ''.join(random.choice(chars) for _ in range(40))
            fixed = "B9dls0fK"
            pos = random.randint(0, len(token))
            return token[:pos] + fixed + token[pos:]
        else:
            # Generic random token
            return ''.join(random.choices(string.ascii_letters + string.digits, k=32))
    
    def get(self, endpoint, headers=None, auth_header=None, auth_pattern="random"):
        """Make API request with discovered auth."""
        url = f"{self.base_url}{endpoint}"
        
        # Build headers
        request_headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'application/json',
        }
        
        # Add discovered auth header
        if auth_header:
            request_headers[auth_header] = self.generate_token(auth_pattern)
        
        # Merge custom headers
        if headers:
            request_headers.update(headers)
        
        response = self.session.get(url, headers=request_headers)
        return response

# Usage example
api = APIReplicator("https://api.solscan.io")
data = api.get(
    "/account/transactions",
    auth_header="sol-aut",
    auth_pattern="solscan"
)
print(data)

6. 发现清单

当处理一个新网站时：

步骤	操作	工具
1	打开开发者工具的网络标签页	F12
2	重新加载页面，按 XHR/Fetch 过滤	网络过滤器
3	寻找 JSON 响应	响应标签页
4	检查是否使用相同端点获取“高级”数据	比较请求
5	在JS文件中查找认证头信息	发起方列
6	提取令牌生成逻辑	JS调试器
7	在Python中复现	复现器类
8	针对API进行测试	运行脚本

品牌数据提取（Firecrawl替代方案）

从任意网站提取品牌数据、颜色、徽标和文案：

from scrapling.fetchers import Fetcher
from urllib.parse import urljoin
import re

def extract_brand_data(url: str) -> dict:
    """Extract structured brand data from any website - Firecrawl style."""
    
    # Try stealth mode first (handles anti-bot)
    try:
        page = Fetcher.get(url)
    except:
        from scrapling.fetchers import StealthyFetcher
        page = StealthyFetcher.fetch(url, headless=True)
    
    # Helper to get text from element
    def get_text(elements):
        return elements[0].text if elements else None
    
    # Helper to get attribute
    def get_attr(elements, attr_name):
        return elements[0].attrib.get(attr_name) if elements else None
    
    # Brand name (try multiple selectors)
    brand_name = (
        get_text(page.css('[property="og:site_name"]')) or
        get_text(page.css('h1')) or
        get_text(page.css('title'))
    )
    
    # Tagline
    tagline = (
        get_text(page.css('[property="og:description"]')) or
        get_text(page.css('.tagline')) or
        get_text(page.css('.hero-text')) or
        get_text(page.css('header h2'))
    )
    
    # Logo URL
    logo_url = (
        get_attr(page.css('[rel="icon"]'), 'href') or
        get_attr(page.css('[rel="apple-touch-icon"]'), 'href') or
        get_attr(page.css('.logo img'), 'src')
    )
    if logo_url and not logo_url.startswith('http'):
        logo_url = urljoin(url, logo_url)
    
    # Favicon
    favicon = get_attr(page.css('[rel="icon"]'), 'href')
    favicon_url = urljoin(url, favicon) if favicon else None
    
    # OG Image
    og_image = get_attr(page.css('[property="og:image"]'), 'content')
    og_image_url = urljoin(url, og_image) if og_image else None
    
    # Screenshot (using external service)
    screenshot_url = f"https://image.thum.io/get/width/1200/crop/800/{url}"
    
    # Description
    description = (
        get_text(page.css('[property="og:description"]')) or
        get_attr(page.css('[name="description"]'), 'content')
    )
    
    # CTA text
    cta_text = (
        get_text(page.css('a[href*="signup"]')) or
        get_text(page.css('.cta')) or
        get_text(page.css('[class*="button"]'))
    )
    
    # Social links
    social_links = {}
    for platform in ['twitter', 'facebook', 'instagram', 'linkedin', 'youtube', 'github']:
        link = get_attr(page.css(f'a[href*="{platform}"]'), 'href')
        if link:
            social_links[platform] = link
    
    # Features (from feature grid/cards)
    features = []
    feature_cards = page.css('[class*="feature"], .feature-card, .benefit-item')
    for card in feature_cards[:6]:
        feature_text = get_text(card.css('h3, h4, p'))
        if feature_text:
            features.append(feature_text.strip())
    
    return {
        'brandName': brand_name,
        'tagline': tagline,
        'description': description,
        'features': features,
        'logoUrl': logo_url,
        'faviconUrl': favicon_url,
        'ctaText': cta_text,
        'socialLinks': social_links,
        'screenshotUrl': screenshot_url,
        'ogImageUrl': og_image_url
    }

# Usage
brand_data = extract_brand_data('https://example.com')
print(brand_data)

品牌数据CLI

# Extract brand data using the Python function above
python3 -c "
import json
import sys
sys.path.insert(0, '/path/to/skill')
from brand_extraction import extract_brand_data
data = extract_brand_data('$URL')
print(json.dumps(data, indent=2))
"

功能对比

功能	状态	备注
基础抓取	✅ 正常	Fetcher.get()
隐蔽抓取	✅ 正常	StealthyFetcher.fetch()
动态抓取	✅ 运行中	DynamicFetcher.fetch()
自适应解析	✅ 运行中	自动保存 + 自适应
爬虫抓取	✅ 运行中	async def parse()
CSS选择器	✅ 运行中	.css()
XPath	✅ 运行中	.xpath()
会话管理	✅ 运行中	FetcherSession, StealthySession
代理轮换	✅ 运行中	ProxyRotator类
CLI工具	✅ 运行中	scrapling extract
品牌数据提取	✅ 运行中	extract_brand_data()
API 逆向工程	✅ 正常	APIReplicator 类
Cloudscraper 绕过	✅ 正常	cloudscraper 集成
简易站点抓取	✅ 正常	EasyCrawl 类
网站地图抓取	✅ 正常	get_sitemap_urls()
MCP 服务器	❌ 已排除	不需要

测试示例

IEEE Spectrum

page = Fetcher.get('https://spectrum.ieee.org/...')
title = page.css('h1::text').get()
content = page.css('article p::text').getall()

✅ 正常

Hacker News

page = Fetcher.get('https://news.ycombinator.com')
stories = page.css('.titleline a::text').getall()

✅ 正常

示例域名

page = Fetcher.get('https://example.com')
title = page.css('h1::text').get()

✅ 正常

🔧 快速故障排除

问题	解决方案
403/429 被阻止	使用 StealthyFetcher 或 cloudscraper
Cloudflare	使用 StealthyFetcher 或 cloudscraper
需要 JavaScript	使用 DynamicFetcher
网站已更改	使用 adaptive=True
付费 API 暴露	使用 API 逆向工程
验证码	无法绕过 - 跳过或使用官方 API
需要认证	请勿绕过 - 使用官方 API

技能图谱

更新日志

v1.0.8 (2026-02-25)

新增：Firecrawl风格爬取- 结合站点地图发现与链接跟踪
新增：use_sitemap参数- 匹配Firecrawl的站点地图"包含"/"跳过"行为
已验证：cloudflare.com从站点地图返回2,447个URL！

v1.0.7 (2026-02-25)

修复：EasyCrawl爬虫语法- 更新以适配scrapling的实际爬虫API
已验证：爬虫功能正常- 测试并爬取了example.com的20多页内容

v1.0.6 (2026-02-25)

新增：简易站点爬取- 通过EasyCrawl爬虫自动爬取域名下所有页面
新增：站点地图爬取- 从sitemap.xml提取URL并爬取
在站点爬取功能上与Firecrawl实现特性对等

v1.0.5 (2026-02-25)

增强：API逆向工程方法
- 来自@paoloanzn工作的详细分步流程
- 包含精确时间线的真实Solscan案例研究
- 新增：分步方法学部分
- 新增：真实示例文档（2025年3月 vs 2026年2月 Solscan）
- 新增：包含10个步骤的发现清单
- 已记录：如何在JS文件中查找认证头信息
- 已记录：令牌生成模式提取
- 已更新：集成多尝试模式的Cloudscraper
- 已验证：Solscan现已修复（两个端点均启用Cloudflare）

v1.0.4 (2026-02-25)

已修复：品牌数据提取API- 修正了scrapling的Response对象选择器
已修复.html→.text/.body
已修复.title()→page.css('title')
已修复.logo img::src→.logo img::attr(src)
已测试并验证可用

v1.0.3 (2026-02-25)

新增：API 逆向工程章节
- API 端点发现（网络标签分析）
- JavaScript 分析（查找认证逻辑）
- 集成 Cloudscraper 以绕过 Cloudflare
- 完整的 APIReplicator 类
- 发现清单
在安装中加入了 cloudscraper

v1.0.2 (2026-02-25)

与上游 GitHub README 完全同步
新增品牌数据提取章节
简洁、仅包含核心的版本

v1.0.1 (2026-02-25)

与原始 Scrapling GitHub README 同步

最后更新：2026-02-25

免责申明

部分文章来自各大搜索引擎，如有侵权，请与我联系删除。

打赏

文章底部电脑广告

手机广告位-内容正文底部

标签

上一篇：Telegram Mini App Dev技能使用说明下一篇：Guru MCP技能使用说明

Scrapling技能使用说明

Scrapling - 自适应网络爬取

致谢

核心库

API逆向工程方法论

安装

代理指令

何时使用Scrapling

快捷命令

1. 基本抓取（最常用）

2. 隐秘抓取（防机器人/Cloudflare）

3. 动态抓取（完整浏览器自动化）

4. 自适应解析（适应设计变更）

5. Spider（多页面）

6. CLI用法

常见模式

提取文章内容

研究多个页面

爬取整个网站（简易模式）

站点地图爬取

Firecrawl风格爬取（两全其美）

处理错误

会话管理

爬虫中的多会话类型

高级解析与导航

进阶：API逆向工程

1. API端点发现

2. JavaScript分析

3. 复制已发现的API

4. Cloudscraper 绕过（Cloudflare）

5. 完整 API 复制模式

6. 发现清单

品牌数据提取（Firecrawl替代方案）

品牌数据CLI

功能对比

测试示例

IEEE Spectrum

Hacker News

示例域名

🔧 快速故障排除

技能图谱

更新日志

v1.0.8 (2026-02-25)

v1.0.7 (2026-02-25)

v1.0.6 (2026-02-25)

v1.0.5 (2026-02-25)

v1.0.4 (2026-02-25)

v1.0.3 (2026-02-25)

v1.0.2 (2026-02-25)

v1.0.1 (2026-02-25)

相关文章

推荐文章

热门浏览

标签列表