Scrapling:自适应现代网络爬虫框架

简介

Scrapling 是一个自适应的网络爬虫框架,能够处理从单个请求到大规模爬取的所有场景。它的解析器可以学习网站的变化,并在页面更新时自动重新定位元素。它的获取器可以开箱即用地绕过 Cloudflare Turnstile 等反爬虫系统。它的 Spider 框架支持并发、多会话爬取,具备暂停/恢复和自动代理轮换功能。

核心特性

  • 自适应解析:网站结构变化后仍能定位元素
  • 反爬虫绕过:内置 StealthyFetcher 绕过 Cloudflare
  • 多会话支持:HTTP 请求和浏览器自动化统一接口
  • 暂停/恢复:基于检查点的爬取持久化
  • 流式模式:实时流式传输抓取结果
  • 完整异步支持:所有组件支持 async/await

安装

基础安装

1
pip install scrapling

完整安装(包含获取器和浏览器)

1
2
pip install "scrapling[fetchers]"
scrapling install

安装所有功能

1
2
pip install "scrapling[all]"
scrapling install

Docker 安装

1
2
3
docker pull pyd4vinci/scrapling
# 或 GitHub Registry
docker pull ghcr.io/d4vinci/scrapling:latest

基础使用

简单 HTTP 请求

1
2
3
4
5
6
7
8
9
10
11
from scrapling.fetchers import Fetcher

# 创建获取器
fetcher = Fetcher()

# 获取页面
response = fetcher.fetch('https://example.com')

# 解析数据
title = response.css('title::text').get()
print(f"Title: {title}")

使用请求头伪装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from scrapling.fetchers import Fetcher

fetcher = Fetcher()

# 自定义请求头
response = fetcher.fetch(
'https://example.com',
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
}
)

print(response.status_code)
print(response.css('h1::text').get())

高级获取器

StealthyFetcher - 反爬虫获取器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from scrapling.fetchers import StealthyFetcher

# 启用自适应模式
StealthyFetcher.adaptive = True

# 获取页面(自动绕过 Cloudflare)
page = StealthyFetcher.fetch(
'https://example.com',
headless=True,
network_idle=True
)

# 提取数据
products = page.css('.product', auto_save=True)
for product in products:
name = product.css('h2::text').get()
price = product.css('.price::text').get()
print(f"{name}: {price}")

DynamicFetcher - 动态页面获取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from scrapling.fetchers import DynamicFetcher

# 获取动态加载的页面
page = DynamicFetcher.fetch(
'https://example.com',
headless=False, # 显示浏览器窗口
wait_for='.loaded', # 等待特定元素加载
timeout=30000
)

# 等待并点击元素
page.click('#load-more')
page.wait_for_selector('.new-items')

# 提取数据
items = page.css('.item')
for item in items:
print(item.css('.title::text').get())

异步获取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import asyncio
from scrapling.fetchers import AsyncFetcher

async def fetch_multiple():
fetcher = AsyncFetcher()

# 并发获取多个页面
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
]

tasks = [fetcher.fetch(url) for url in urls]
responses = await asyncio.gather(*tasks)

for response in responses:
print(response.css('title::text').get())

# 运行
asyncio.run(fetch_multiple())

会话管理

FetcherSession - HTTP 会话

1
2
3
4
5
6
7
8
9
10
11
12
13
from scrapling.fetchers import FetcherSession

# 创建会话(自动管理 cookies)
with FetcherSession() as session:
# 登录
login_response = session.post(
'https://example.com/login',
data={'username': 'user', 'password': 'pass'}
)

# 后续请求自动携带登录状态
profile = session.fetch('https://example.com/profile')
print(profile.css('.username::text').get())

StealthySession - 隐身会话

1
2
3
4
5
6
7
8
9
10
from scrapling.fetchers import StealthySession

with StealthySession() as session:
# 所有请求共享浏览器上下文
page1 = session.fetch('https://example.com/page1')
page2 = session.fetch('https://example.com/page2')

# 提取数据
data1 = page1.css('.data::text').getall()
data2 = page2.css('.data::text').getall()

DynamicSession - 动态会话

1
2
3
4
5
6
7
8
9
10
11
12
from scrapling.fetchers import DynamicSession

with DynamicSession(headless=False) as session:
# 第一个页面
page1 = session.fetch('https://example.com/login')
page1.fill('#username', 'myuser')
page1.fill('#password', 'mypass')
page1.click('#submit')

# 登录后访问其他页面
page2 = session.fetch('https://example.com/dashboard')
print(page2.css('.welcome::text').get())

自适应解析

基础 CSS 选择器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from scrapling.fetchers import Fetcher

response = Fetcher().fetch('https://example.com')

# 单个元素
title = response.css('title::text').get()

# 多个元素
links = response.css('a::attr(href)').getall()

# 嵌套选择
for item in response.css('.product'):
name = item.css('h2::text').get()
price = item.css('.price::text').get()
link = item.css('a::attr(href)').get()
print(f"{name}: {price} - {link}")

XPath 选择器

1
2
3
4
5
6
7
8
9
10
11
12
from scrapling.fetchers import Fetcher

response = Fetcher().fetch('https://example.com')

# XPath 选择
title = response.xpath('//title/text()').get()

# 属性选择
links = response.xpath('//a/@href').getall()

# 条件选择
items = response.xpath('//div[@class="product"][@data-available="true"]')

自适应选择(网站结构变化后仍能工作)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True

page = StealthyFetcher.fetch('https://example.com')

# 首次抓取时保存元素特征
products = page.css('.product', auto_save=True)

# 网站结构变化后,使用 adaptive=True 重新定位
# 即使 CSS 类名改变也能找到元素
products = page.css('.product', adaptive=True)

for product in products:
print(product.css('h2::text').get())

查找相似元素

1
2
3
4
5
6
7
8
9
10
11
12
from scrapling.fetchers import Fetcher

response = Fetcher().fetch('https://example.com')

# 找到一个元素
first_item = response.css('.item')[0]

# 查找相似元素
similar_items = first_item.find_similar()

for item in similar_items:
print(item.css('.title::text').get())

Spider 框架

基础 Spider

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from scrapling.spiders import Spider, Response

class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]

async def parse(self, response: Response):
# 提取数据
for item in response.css('.product'):
yield {
"title": item.css('h2::text').get(),
"price": item.css('.price::text').get(),
"link": item.css('a::attr(href)').get(),
}

# 跟踪下一页
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page)

# 启动爬虫
if __name__ == "__main__":
MySpider().start()

多会话 Spider

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from scrapling.spiders import Spider, Response

class MultiSessionSpider(Spider):
name = "multi_session"
start_urls = ["https://example.com/"]

# 配置会话
custom_settings = {
'CONCURRENT_SESSIONS': 3,
'SESSION_TYPE': 'stealthy', # 或 'dynamic', 'fetcher'
}

async def parse(self, response: Response):
# 根据需要使用不同会话
session_id = response.meta.get('session_id')

yield {
"url": response.url,
"session": session_id,
"title": response.css('title::text').get(),
}

if __name__ == "__main__":
MultiSessionSpider().start()

流式输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from scrapling.spiders import Spider, Response

class StreamingSpider(Spider):
name = "streaming"
start_urls = ["https://example.com/products"]

async def parse(self, response: Response):
for product in response.css('.product'):
yield {
"name": product.css('h2::text').get(),
"price": product.css('.price::text').get(),
}

# 流式获取结果
async def main():
spider = StreamingSpider()

async for item in spider.stream():
print(f"Received: {item}")

if __name__ == "__main__":
import asyncio
asyncio.run(main())

暂停和恢复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from scrapling.spiders import Spider, Response

class ResumableSpider(Spider):
name = "resumable"
start_urls = ["https://example.com/"]

# 启用检查点
custom_settings = {
'CHECKPOINT_ENABLED': True,
'CHECKPOINT_DIR': './checkpoints',
}

async def parse(self, response: Response):
for item in response.css('.item'):
yield {
"title": item.css('h2::text').get(),
"url": item.css('a::attr(href)').get(),
}

# 跟踪分页
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page)

# 首次运行
if __name__ == "__main__":
spider = ResumableSpider()

try:
spider.start()
except KeyboardInterrupt:
print("Spider paused. Run again to resume.")

# 再次运行会自动从检查点恢复
# spider.start()

代理管理

基础代理

1
2
3
4
5
6
7
8
9
from scrapling.fetchers import Fetcher

fetcher = Fetcher()

# 单次请求代理
response = fetcher.fetch(
'https://example.com',
proxy='http://user:pass@proxy.example.com:8080'
)

代理轮换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from scrapling.fetchers import Fetcher
from scrapling.proxy import ProxyRotator

# 创建代理轮换器
rotator = ProxyRotator([
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080',
])

fetcher = Fetcher()

# 每次请求自动轮换代理
for url in urls:
response = fetcher.fetch(url, proxy=rotator.get_proxy())
print(response.status_code)

Spider 代理配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from scrapling.spiders import Spider, Response

class ProxySpider(Spider):
name = "proxy"
start_urls = ["https://example.com/"]

custom_settings = {
'PROXY_LIST': [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
],
'PROXY_ROTATION': 'cyclic', # 或 'random'
}

async def parse(self, response: Response):
yield {
"url": response.url,
"title": response.css('title::text').get(),
}

数据导出

JSON 导出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from scrapling.spiders import Spider, Response

class ExportSpider(Spider):
name = "export"
start_urls = ["https://example.com/"]

async def parse(self, response: Response):
for item in response.css('.product'):
yield {
"name": item.css('h2::text').get(),
"price": item.css('.price::text').get(),
}

# 运行并导出
if __name__ == "__main__":
spider = ExportSpider()
result = spider.start()

# 导出为 JSON
result.items.to_json('products.json')

# 导出为 JSONL
result.items.to_jsonl('products.jsonl')

自定义 Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from scrapling.spiders import Spider, Response

class CustomPipeline:
def __init__(self):
self.items = []

def process_item(self, item, spider):
# 处理每个 item
item['processed'] = True
self.items.append(item)
return item

def close_spider(self, spider):
# 爬虫结束时执行
print(f"Total items: {len(self.items)}")

class PipelineSpider(Spider):
name = "pipeline"
start_urls = ["https://example.com/"]

custom_settings = {
'ITEM_PIPELINES': [CustomPipeline],
}

async def parse(self, response: Response):
yield {"title": response.css('title::text').get()}

命令行工具

快速抓取

1
2
3
4
5
6
7
8
# 无需编写代码,直接抓取 URL
scrapling fetch https://example.com

# 保存到文件
scrapling fetch https://example.com -o output.html

# 使用 stealth 模式
scrapling fetch https://example.com --stealth

交互式 Shell

1
2
3
4
5
6
# 启动交互式 shell
scrapling shell https://example.com

# 在 shell 中可以使用 response 对象
>>> response.css('title::text').get()
>>> response.css('a::attr(href)').getall()

提取命令

1
2
3
4
5
# 使用 CSS 选择器提取数据
scrapling extract https://example.com "h2::text"

# 提取属性
scrapling extract https://example.com "a::attr(href)"

与 BeautifulSoup 对比

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# BeautifulSoup
from bs4 import BeautifulSoup
import requests

response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('title').text

# Scrapling
from scrapling.fetchers import Fetcher

response = Fetcher().fetch('https://example.com')
title = response.css('title::text').get()

# Scrapling 优势:
# 1. 内置请求功能,无需额外导入
# 2. 支持 CSS 和 XPath 选择器
# 3. 自适应解析,网站变化后仍能工作
# 4. 内置反爬虫绕过
# 5. 完整的异步支持

性能对比

Scrapling 在性能上优于大多数 Python 爬虫库:

  • JSON 序列化:比标准库快 10 倍
  • 内存效率:优化的数据结构和延迟加载
  • 并发处理:内置高效的并发控制

相关资源

总结

Scrapling 是一个功能全面的现代爬虫框架,它结合了 Requests 的简洁、BeautifulSoup 的易用、Scrapy 的强大,以及 Playwright 的浏览器自动化能力。其自适应解析功能特别适合需要长期维护的爬虫项目,而内置的反爬虫绕过能力则让开发者可以专注于数据提取逻辑。


提示:Scrapling 需要 Python 3.10 或更高版本。对于需要绕过 Cloudflare 等强反爬虫保护的网站,建议使用 StealthyFetcher 或 DynamicFetcher。


Scrapling:自适应现代网络爬虫框架
https://kingjem.github.io/2026/03/17/Scrapling:自适应现代网络爬虫框架/
作者
Ruhai
发布于
2026年3月17日
许可协议