Scrapling：自适应现代网络爬虫框架

简介

Scrapling 是一个自适应的网络爬虫框架，能够处理从单个请求到大规模爬取的所有场景。它的解析器可以学习网站的变化，并在页面更新时自动重新定位元素。它的获取器可以开箱即用地绕过 Cloudflare Turnstile 等反爬虫系统。它的 Spider 框架支持并发、多会话爬取，具备暂停/恢复和自动代理轮换功能。

核心特性

自适应解析：网站结构变化后仍能定位元素
反爬虫绕过：内置 StealthyFetcher 绕过 Cloudflare
多会话支持：HTTP 请求和浏览器自动化统一接口
暂停/恢复：基于检查点的爬取持久化
流式模式：实时流式传输抓取结果
完整异步支持：所有组件支持 async/await

安装

基础安装

1	`pip install scrapling`

完整安装（包含获取器和浏览器）

1 2	`pip install "scrapling[fetchers]" scrapling install`

安装所有功能

1 2	`pip install "scrapling[all]" scrapling install`

Docker 安装

1
2
3

docker pull pyd4vinci/scrapling
# 或 GitHub Registry
docker pull ghcr.io/d4vinci/scrapling:latest

基础使用

简单 HTTP 请求

from scrapling.fetchers import Fetcher

# 创建获取器
fetcher = Fetcher()

# 获取页面
response = fetcher.fetch('https://example.com')

# 解析数据
title = response.css('title::text').get()
print(f"Title: {title}")

使用请求头伪装

from scrapling.fetchers import Fetcher

fetcher = Fetcher()

# 自定义请求头
response = fetcher.fetch(
    'https://example.com',
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
    }
)

print(response.status_code)
print(response.css('h1::text').get())

高级获取器

StealthyFetcher - 反爬虫获取器

from scrapling.fetchers import StealthyFetcher

# 启用自适应模式
StealthyFetcher.adaptive = True

# 获取页面（自动绕过 Cloudflare）
page = StealthyFetcher.fetch(
    'https://example.com',
    headless=True,
    network_idle=True
)

# 提取数据
products = page.css('.product', auto_save=True)
for product in products:
    name = product.css('h2::text').get()
    price = product.css('.price::text').get()
    print(f"{name}: {price}")

DynamicFetcher - 动态页面获取

from scrapling.fetchers import DynamicFetcher

# 获取动态加载的页面
page = DynamicFetcher.fetch(
    'https://example.com',
    headless=False,  # 显示浏览器窗口
    wait_for='.loaded',  # 等待特定元素加载
    timeout=30000
)

# 等待并点击元素
page.click('#load-more')
page.wait_for_selector('.new-items')

# 提取数据
items = page.css('.item')
for item in items:
    print(item.css('.title::text').get())

异步获取

import asyncio
from scrapling.fetchers import AsyncFetcher

async def fetch_multiple():
    fetcher = AsyncFetcher()
    
    # 并发获取多个页面
    urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3',
    ]
    
    tasks = [fetcher.fetch(url) for url in urls]
    responses = await asyncio.gather(*tasks)
    
    for response in responses:
        print(response.css('title::text').get())

# 运行
asyncio.run(fetch_multiple())

会话管理

FetcherSession - HTTP 会话

from scrapling.fetchers import FetcherSession

# 创建会话（自动管理 cookies）
with FetcherSession() as session:
    # 登录
    login_response = session.post(
        'https://example.com/login',
        data={'username': 'user', 'password': 'pass'}
    )
    
    # 后续请求自动携带登录状态
    profile = session.fetch('https://example.com/profile')
    print(profile.css('.username::text').get())

StealthySession - 隐身会话

from scrapling.fetchers import StealthySession

with StealthySession() as session:
    # 所有请求共享浏览器上下文
    page1 = session.fetch('https://example.com/page1')
    page2 = session.fetch('https://example.com/page2')
    
    # 提取数据
    data1 = page1.css('.data::text').getall()
    data2 = page2.css('.data::text').getall()

DynamicSession - 动态会话

from scrapling.fetchers import DynamicSession

with DynamicSession(headless=False) as session:
    # 第一个页面
    page1 = session.fetch('https://example.com/login')
    page1.fill('#username', 'myuser')
    page1.fill('#password', 'mypass')
    page1.click('#submit')
    
    # 登录后访问其他页面
    page2 = session.fetch('https://example.com/dashboard')
    print(page2.css('.welcome::text').get())

自适应解析

基础 CSS 选择器

from scrapling.fetchers import Fetcher

response = Fetcher().fetch('https://example.com')

# 单个元素
title = response.css('title::text').get()

# 多个元素
links = response.css('a::attr(href)').getall()

# 嵌套选择
for item in response.css('.product'):
    name = item.css('h2::text').get()
    price = item.css('.price::text').get()
    link = item.css('a::attr(href)').get()
    print(f"{name}: {price} - {link}")

XPath 选择器

from scrapling.fetchers import Fetcher

response = Fetcher().fetch('https://example.com')

# XPath 选择
title = response.xpath('//title/text()').get()

# 属性选择
links = response.xpath('//a/@href').getall()

# 条件选择
items = response.xpath('//div[@class="product"][@data-available="true"]')

自适应选择（网站结构变化后仍能工作）

from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True

page = StealthyFetcher.fetch('https://example.com')

# 首次抓取时保存元素特征
products = page.css('.product', auto_save=True)

# 网站结构变化后，使用 adaptive=True 重新定位
# 即使 CSS 类名改变也能找到元素
products = page.css('.product', adaptive=True)

for product in products:
    print(product.css('h2::text').get())

查找相似元素

from scrapling.fetchers import Fetcher

response = Fetcher().fetch('https://example.com')

# 找到一个元素
first_item = response.css('.item')[0]

# 查找相似元素
similar_items = first_item.find_similar()

for item in similar_items:
    print(item.css('.title::text').get())

Spider 框架

基础 Spider

from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "demo"
    start_urls = ["https://example.com/"]
    
    async def parse(self, response: Response):
        # 提取数据
        for item in response.css('.product'):
            yield {
                "title": item.css('h2::text').get(),
                "price": item.css('.price::text').get(),
                "link": item.css('a::attr(href)').get(),
            }
        
        # 跟踪下一页
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# 启动爬虫
if __name__ == "__main__":
    MySpider().start()

多会话 Spider

from scrapling.spiders import Spider, Response

class MultiSessionSpider(Spider):
    name = "multi_session"
    start_urls = ["https://example.com/"]
    
    # 配置会话
    custom_settings = {
        'CONCURRENT_SESSIONS': 3,
        'SESSION_TYPE': 'stealthy',  # 或 'dynamic', 'fetcher'
    }
    
    async def parse(self, response: Response):
        # 根据需要使用不同会话
        session_id = response.meta.get('session_id')
        
        yield {
            "url": response.url,
            "session": session_id,
            "title": response.css('title::text').get(),
        }

if __name__ == "__main__":
    MultiSessionSpider().start()

流式输出

from scrapling.spiders import Spider, Response

class StreamingSpider(Spider):
    name = "streaming"
    start_urls = ["https://example.com/products"]
    
    async def parse(self, response: Response):
        for product in response.css('.product'):
            yield {
                "name": product.css('h2::text').get(),
                "price": product.css('.price::text').get(),
            }

# 流式获取结果
async def main():
    spider = StreamingSpider()
    
    async for item in spider.stream():
        print(f"Received: {item}")

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

暂停和恢复

from scrapling.spiders import Spider, Response

class ResumableSpider(Spider):
    name = "resumable"
    start_urls = ["https://example.com/"]
    
    # 启用检查点
    custom_settings = {
        'CHECKPOINT_ENABLED': True,
        'CHECKPOINT_DIR': './checkpoints',
    }
    
    async def parse(self, response: Response):
        for item in response.css('.item'):
            yield {
                "title": item.css('h2::text').get(),
                "url": item.css('a::attr(href)').get(),
            }
        
        # 跟踪分页
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# 首次运行
if __name__ == "__main__":
    spider = ResumableSpider()
    
    try:
        spider.start()
    except KeyboardInterrupt:
        print("Spider paused. Run again to resume.")
    
    # 再次运行会自动从检查点恢复
    # spider.start()

代理管理

基础代理

from scrapling.fetchers import Fetcher

fetcher = Fetcher()

# 单次请求代理
response = fetcher.fetch(
    'https://example.com',
    proxy='http://user:pass@proxy.example.com:8080'
)

代理轮换

from scrapling.fetchers import Fetcher
from scrapling.proxy import ProxyRotator

# 创建代理轮换器
rotator = ProxyRotator([
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
    'http://proxy3.example.com:8080',
])

fetcher = Fetcher()

# 每次请求自动轮换代理
for url in urls:
    response = fetcher.fetch(url, proxy=rotator.get_proxy())
    print(response.status_code)

Spider 代理配置

from scrapling.spiders import Spider, Response

class ProxySpider(Spider):
    name = "proxy"
    start_urls = ["https://example.com/"]
    
    custom_settings = {
        'PROXY_LIST': [
            'http://proxy1.example.com:8080',
            'http://proxy2.example.com:8080',
        ],
        'PROXY_ROTATION': 'cyclic',  # 或 'random'
    }
    
    async def parse(self, response: Response):
        yield {
            "url": response.url,
            "title": response.css('title::text').get(),
        }

数据导出

JSON 导出

from scrapling.spiders import Spider, Response

class ExportSpider(Spider):
    name = "export"
    start_urls = ["https://example.com/"]
    
    async def parse(self, response: Response):
        for item in response.css('.product'):
            yield {
                "name": item.css('h2::text').get(),
                "price": item.css('.price::text').get(),
            }

# 运行并导出
if __name__ == "__main__":
    spider = ExportSpider()
    result = spider.start()
    
    # 导出为 JSON
    result.items.to_json('products.json')
    
    # 导出为 JSONL
    result.items.to_jsonl('products.jsonl')

自定义 Pipeline

from scrapling.spiders import Spider, Response

class CustomPipeline:
    def __init__(self):
        self.items = []
    
    def process_item(self, item, spider):
        # 处理每个 item
        item['processed'] = True
        self.items.append(item)
        return item
    
    def close_spider(self, spider):
        # 爬虫结束时执行
        print(f"Total items: {len(self.items)}")

class PipelineSpider(Spider):
    name = "pipeline"
    start_urls = ["https://example.com/"]
    
    custom_settings = {
        'ITEM_PIPELINES': [CustomPipeline],
    }
    
    async def parse(self, response: Response):
        yield {"title": response.css('title::text').get()}

命令行工具

快速抓取

# 无需编写代码，直接抓取 URL
scrapling fetch https://example.com

# 保存到文件
scrapling fetch https://example.com -o output.html

# 使用 stealth 模式
scrapling fetch https://example.com --stealth

交互式 Shell

# 启动交互式 shell
scrapling shell https://example.com

# 在 shell 中可以使用 response 对象
>>> response.css('title::text').get()
>>> response.css('a::attr(href)').getall()

提取命令

# 使用 CSS 选择器提取数据
scrapling extract https://example.com "h2::text"

# 提取属性
scrapling extract https://example.com "a::attr(href)"

与 BeautifulSoup 对比

# BeautifulSoup
from bs4 import BeautifulSoup
import requests

response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('title').text

# Scrapling
from scrapling.fetchers import Fetcher

response = Fetcher().fetch('https://example.com')
title = response.css('title::text').get()

# Scrapling 优势：
# 1. 内置请求功能，无需额外导入
# 2. 支持 CSS 和 XPath 选择器
# 3. 自适应解析，网站变化后仍能工作
# 4. 内置反爬虫绕过
# 5. 完整的异步支持

性能对比

Scrapling 在性能上优于大多数 Python 爬虫库：

JSON 序列化：比标准库快 10 倍
内存效率：优化的数据结构和延迟加载
并发处理：内置高效的并发控制

总结

Scrapling 是一个功能全面的现代爬虫框架，它结合了 Requests 的简洁、BeautifulSoup 的易用、Scrapy 的强大，以及 Playwright 的浏览器自动化能力。其自适应解析功能特别适合需要长期维护的爬虫项目，而内置的反爬虫绕过能力则让开发者可以专注于数据提取逻辑。

提示：Scrapling 需要 Python 3.10 或更高版本。对于需要绕过 Cloudflare 等强反爬虫保护的网站，建议使用 StealthyFetcher 或 DynamicFetcher。

技术笔记 > 爬虫开发

#Python #Playwright #Scrapling #爬虫框架 #网络抓取 #反爬虫

Scrapling：自适应现代网络爬虫框架

https://kingjem.github.io/2026/03/17/Scrapling：自适应现代网络爬虫框架/

作者

Ruhai

发布于

2026年3月17日

许可协议

curl-impersonate JA3 指纹原理与复现上一篇

BrowserForge：智能浏览器指纹与请求头生成器下一篇

Scrapling：自适应现代网络爬虫框架

简介

核心特性

安装

基础安装

完整安装（包含获取器和浏览器）

安装所有功能

Docker 安装

基础使用

简单 HTTP 请求

使用请求头伪装

高级获取器

StealthyFetcher - 反爬虫获取器

DynamicFetcher - 动态页面获取

异步获取

会话管理

FetcherSession - HTTP 会话

StealthySession - 隐身会话

DynamicSession - 动态会话

自适应解析

基础 CSS 选择器

XPath 选择器

自适应选择（网站结构变化后仍能工作）

查找相似元素

Spider 框架

基础 Spider

多会话 Spider

流式输出

暂停和恢复

代理管理

基础代理

代理轮换

Spider 代理配置

数据导出

JSON 导出

自定义 Pipeline

命令行工具

快速抓取

交互式 Shell

提取命令

与 BeautifulSoup 对比

性能对比

相关资源

总结