简介 Scrapling 是一个自适应的网络爬虫框架,能够处理从单个请求到大规模爬取的所有场景。它的解析器可以学习网站的变化,并在页面更新时自动重新定位元素。它的获取器可以开箱即用地绕过 Cloudflare Turnstile 等反爬虫系统。它的 Spider 框架支持并发、多会话爬取,具备暂停/恢复和自动代理轮换功能。
核心特性
自适应解析:网站结构变化后仍能定位元素
反爬虫绕过:内置 StealthyFetcher 绕过 Cloudflare
多会话支持:HTTP 请求和浏览器自动化统一接口
暂停/恢复:基于检查点的爬取持久化
流式模式:实时流式传输抓取结果
完整异步支持:所有组件支持 async/await
安装 基础安装
完整安装(包含获取器和浏览器) 1 2 pip install "scrapling[fetchers]" scrapling install
安装所有功能 1 2 pip install "scrapling[all]" scrapling install
Docker 安装 1 2 3 docker pull pyd4vinci/scrapling docker pull ghcr.io/d4vinci/scrapling:latest
基础使用 简单 HTTP 请求 1 2 3 4 5 6 7 8 9 10 11 from scrapling.fetchers import Fetcher fetcher = Fetcher() response = fetcher.fetch('https://example.com' ) title = response.css('title::text' ).get()print (f"Title: {title} " )
使用请求头伪装 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from scrapling.fetchers import Fetcher fetcher = Fetcher() response = fetcher.fetch( 'https://example.com' , headers={ 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' , 'Accept-Language' : 'en-US,en;q=0.9' , } )print (response.status_code)print (response.css('h1::text' ).get())
高级获取器 StealthyFetcher - 反爬虫获取器 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 from scrapling.fetchers import StealthyFetcher StealthyFetcher.adaptive = True page = StealthyFetcher.fetch( 'https://example.com' , headless=True , network_idle=True ) products = page.css('.product' , auto_save=True )for product in products: name = product.css('h2::text' ).get() price = product.css('.price::text' ).get() print (f"{name} : {price} " )
DynamicFetcher - 动态页面获取 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 from scrapling.fetchers import DynamicFetcher page = DynamicFetcher.fetch( 'https://example.com' , headless=False , wait_for='.loaded' , timeout=30000 ) page.click('#load-more' ) page.wait_for_selector('.new-items' ) items = page.css('.item' )for item in items: print (item.css('.title::text' ).get())
异步获取 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import asynciofrom scrapling.fetchers import AsyncFetcherasync def fetch_multiple (): fetcher = AsyncFetcher() urls = [ 'https://example.com/page1' , 'https://example.com/page2' , 'https://example.com/page3' , ] tasks = [fetcher.fetch(url) for url in urls] responses = await asyncio.gather(*tasks) for response in responses: print (response.css('title::text' ).get()) asyncio.run(fetch_multiple())
会话管理 FetcherSession - HTTP 会话 1 2 3 4 5 6 7 8 9 10 11 12 13 from scrapling.fetchers import FetcherSessionwith FetcherSession() as session: login_response = session.post( 'https://example.com/login' , data={'username' : 'user' , 'password' : 'pass' } ) profile = session.fetch('https://example.com/profile' ) print (profile.css('.username::text' ).get())
StealthySession - 隐身会话 1 2 3 4 5 6 7 8 9 10 from scrapling.fetchers import StealthySessionwith StealthySession() as session: page1 = session.fetch('https://example.com/page1' ) page2 = session.fetch('https://example.com/page2' ) data1 = page1.css('.data::text' ).getall() data2 = page2.css('.data::text' ).getall()
DynamicSession - 动态会话 1 2 3 4 5 6 7 8 9 10 11 12 from scrapling.fetchers import DynamicSessionwith DynamicSession(headless=False ) as session: page1 = session.fetch('https://example.com/login' ) page1.fill('#username' , 'myuser' ) page1.fill('#password' , 'mypass' ) page1.click('#submit' ) page2 = session.fetch('https://example.com/dashboard' ) print (page2.css('.welcome::text' ).get())
自适应解析 基础 CSS 选择器 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from scrapling.fetchers import Fetcher response = Fetcher().fetch('https://example.com' ) title = response.css('title::text' ).get() links = response.css('a::attr(href)' ).getall()for item in response.css('.product' ): name = item.css('h2::text' ).get() price = item.css('.price::text' ).get() link = item.css('a::attr(href)' ).get() print (f"{name} : {price} - {link} " )
XPath 选择器 1 2 3 4 5 6 7 8 9 10 11 12 from scrapling.fetchers import Fetcher response = Fetcher().fetch('https://example.com' ) title = response.xpath('//title/text()' ).get() links = response.xpath('//a/@href' ).getall() items = response.xpath('//div[@class="product"][@data-available="true"]' )
自适应选择(网站结构变化后仍能工作) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from scrapling.fetchers import StealthyFetcher StealthyFetcher.adaptive = True page = StealthyFetcher.fetch('https://example.com' ) products = page.css('.product' , auto_save=True ) products = page.css('.product' , adaptive=True )for product in products: print (product.css('h2::text' ).get())
查找相似元素 1 2 3 4 5 6 7 8 9 10 11 12 from scrapling.fetchers import Fetcher response = Fetcher().fetch('https://example.com' ) first_item = response.css('.item' )[0 ] similar_items = first_item.find_similar()for item in similar_items: print (item.css('.title::text' ).get())
Spider 框架 基础 Spider 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from scrapling.spiders import Spider, Responseclass MySpider (Spider ): name = "demo" start_urls = ["https://example.com/" ] async def parse (self, response: Response ): for item in response.css('.product' ): yield { "title" : item.css('h2::text' ).get(), "price" : item.css('.price::text' ).get(), "link" : item.css('a::attr(href)' ).get(), } next_page = response.css('a.next::attr(href)' ).get() if next_page: yield response.follow(next_page)if __name__ == "__main__" : MySpider().start()
多会话 Spider 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 from scrapling.spiders import Spider, Responseclass MultiSessionSpider (Spider ): name = "multi_session" start_urls = ["https://example.com/" ] custom_settings = { 'CONCURRENT_SESSIONS' : 3 , 'SESSION_TYPE' : 'stealthy' , } async def parse (self, response: Response ): session_id = response.meta.get('session_id' ) yield { "url" : response.url, "session" : session_id, "title" : response.css('title::text' ).get(), }if __name__ == "__main__" : MultiSessionSpider().start()
流式输出 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from scrapling.spiders import Spider, Responseclass StreamingSpider (Spider ): name = "streaming" start_urls = ["https://example.com/products" ] async def parse (self, response: Response ): for product in response.css('.product' ): yield { "name" : product.css('h2::text' ).get(), "price" : product.css('.price::text' ).get(), }async def main (): spider = StreamingSpider() async for item in spider.stream(): print (f"Received: {item} " )if __name__ == "__main__" : import asyncio asyncio.run(main())
暂停和恢复 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 from scrapling.spiders import Spider, Responseclass ResumableSpider (Spider ): name = "resumable" start_urls = ["https://example.com/" ] custom_settings = { 'CHECKPOINT_ENABLED' : True , 'CHECKPOINT_DIR' : './checkpoints' , } async def parse (self, response: Response ): for item in response.css('.item' ): yield { "title" : item.css('h2::text' ).get(), "url" : item.css('a::attr(href)' ).get(), } next_page = response.css('a.next::attr(href)' ).get() if next_page: yield response.follow(next_page)if __name__ == "__main__" : spider = ResumableSpider() try : spider.start() except KeyboardInterrupt: print ("Spider paused. Run again to resume." )
代理管理 基础代理 1 2 3 4 5 6 7 8 9 from scrapling.fetchers import Fetcher fetcher = Fetcher() response = fetcher.fetch( 'https://example.com' , proxy='http://user:pass@proxy.example.com:8080' )
代理轮换 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from scrapling.fetchers import Fetcherfrom scrapling.proxy import ProxyRotator rotator = ProxyRotator([ 'http://proxy1.example.com:8080' , 'http://proxy2.example.com:8080' , 'http://proxy3.example.com:8080' , ]) fetcher = Fetcher()for url in urls: response = fetcher.fetch(url, proxy=rotator.get_proxy()) print (response.status_code)
Spider 代理配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from scrapling.spiders import Spider, Responseclass ProxySpider (Spider ): name = "proxy" start_urls = ["https://example.com/" ] custom_settings = { 'PROXY_LIST' : [ 'http://proxy1.example.com:8080' , 'http://proxy2.example.com:8080' , ], 'PROXY_ROTATION' : 'cyclic' , } async def parse (self, response: Response ): yield { "url" : response.url, "title" : response.css('title::text' ).get(), }
数据导出 JSON 导出 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from scrapling.spiders import Spider, Responseclass ExportSpider (Spider ): name = "export" start_urls = ["https://example.com/" ] async def parse (self, response: Response ): for item in response.css('.product' ): yield { "name" : item.css('h2::text' ).get(), "price" : item.css('.price::text' ).get(), }if __name__ == "__main__" : spider = ExportSpider() result = spider.start() result.items.to_json('products.json' ) result.items.to_jsonl('products.jsonl' )
自定义 Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 from scrapling.spiders import Spider, Responseclass CustomPipeline : def __init__ (self ): self .items = [] def process_item (self, item, spider ): item['processed' ] = True self .items.append(item) return item def close_spider (self, spider ): print (f"Total items: {len (self.items)} " )class PipelineSpider (Spider ): name = "pipeline" start_urls = ["https://example.com/" ] custom_settings = { 'ITEM_PIPELINES' : [CustomPipeline], } async def parse (self, response: Response ): yield {"title" : response.css('title::text' ).get()}
命令行工具 快速抓取 1 2 3 4 5 6 7 8 scrapling fetch https://example.com scrapling fetch https://example.com -o output.html scrapling fetch https://example.com --stealth
交互式 Shell 1 2 3 4 5 6 scrapling shell https://example.com >>> response.css('title::text' ).get() >>> response.css('a::attr(href)' ).getall()
提取命令 1 2 3 4 5 scrapling extract https://example.com "h2::text" scrapling extract https://example.com "a::attr(href)"
与 BeautifulSoup 对比 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from bs4 import BeautifulSoupimport requests response = requests.get('https://example.com' ) soup = BeautifulSoup(response.text, 'html.parser' ) title = soup.find('title' ).textfrom scrapling.fetchers import Fetcher response = Fetcher().fetch('https://example.com' ) title = response.css('title::text' ).get()
性能对比 Scrapling 在性能上优于大多数 Python 爬虫库:
JSON 序列化:比标准库快 10 倍
内存效率:优化的数据结构和延迟加载
并发处理:内置高效的并发控制
相关资源
总结 Scrapling 是一个功能全面的现代爬虫框架,它结合了 Requests 的简洁、BeautifulSoup 的易用、Scrapy 的强大,以及 Playwright 的浏览器自动化能力。其自适应解析功能特别适合需要长期维护的爬虫项目,而内置的反爬虫绕过能力则让开发者可以专注于数据提取逻辑。
提示:Scrapling 需要 Python 3.10 或更高版本。对于需要绕过 Cloudflare 等强反爬虫保护的网站,建议使用 StealthyFetcher 或 DynamicFetcher。