blog-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBlog Scraper
博客抓取工具
Scrape blog posts via RSS/Atom feeds (free) with optional Apify fallback for JS-heavy sites.
通过RSS/Atom源免费抓取博客文章,针对JS密集型网站可选择使用Apify作为备选方案。
Quick Start
快速开始
For RSS mode (free), only dependency is . No API key needed.
pip install requestsbash
undefined对于RSS模式(免费),仅需安装依赖,无需API密钥。
pip install requestsbash
undefinedScrape a blog's RSS feed
Scrape a blog's RSS feed
python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com/blog" --days 30
--urls "https://example.com/blog" --days 30
python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com/blog" --days 30
--urls "https://example.com/blog" --days 30
Multiple blogs with keyword filter
Multiple blogs with keyword filter
python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://blog1.com,https://blog2.com" --keywords "AI,marketing" --output summary
--urls "https://blog1.com,https://blog2.com" --keywords "AI,marketing" --output summary
python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://blog1.com,https://blog2.com" --keywords "AI,marketing" --output summary
--urls "https://blog1.com,https://blog2.com" --keywords "AI,marketing" --output summary
Force Apify for JS-heavy sites
Force Apify for JS-heavy sites
python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com" --mode apify
--urls "https://example.com" --mode apify
undefinedpython3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com" --mode apify
--urls "https://example.com" --mode apify
undefinedHow It Works
工作原理
Auto Mode (default)
自动模式(默认)
- For each URL, tries to discover an RSS/Atom feed:
- Checks HTML tags
<link rel="alternate"> - Probes common paths: ,
/feed,/rss,/atom.xml,/feed.xml,/rss.xml,/blog/feed/index.xml
- Checks HTML
- Parses discovered feeds (supports RSS 2.0 and Atom)
- If any URLs fail, falls back to Apify (if token available)
jupri/rss-xml-scraper - Applies date and keyword filtering client-side
- 针对每个URL,尝试发现RSS/Atom源:
- 检查HTML 标签
<link rel="alternate"> - 探测常见路径:、
/feed、/rss、/atom.xml、/feed.xml、/rss.xml、/blog/feed/index.xml
- 检查HTML
- 解析发现的源(支持RSS 2.0和Atom格式)
- 若任何URL抓取失败,将 fallback 到Apify (需提供可用令牌)
jupri/rss-xml-scraper - 在客户端应用日期和关键词过滤
RSS Mode
RSS模式
Only tries RSS feeds, no Apify fallback.
仅尝试使用RSS源,无Apify备选方案。
Apify Mode
Apify模式
Uses Apify actor directly, skipping RSS discovery.
直接使用Apify actor,跳过RSS源发现步骤。
CLI Reference
CLI参考
| Flag | Default | Description |
|---|---|---|
| required | Blog URL(s), comma-separated |
| none | Keywords to filter (comma-separated, OR logic) |
| 30 | Only include posts from last N days |
| 50 | Max posts to return |
| auto | |
| json | Output format: |
| env var | Apify token (only needed for Apify mode/fallback) |
| 300 | Max seconds for Apify run |
| 标志 | 默认值 | 说明 |
|---|---|---|
| 必填 | 博客URL(多个用逗号分隔) |
| 无 | 用于过滤的关键词(多个用逗号分隔,逻辑或) |
| 30 | 仅包含最近N天内的文章 |
| 50 | 返回的最大文章数量 |
| auto | |
| json | 输出格式: |
| 环境变量 | Apify令牌(仅在Apify模式/备选方案中需要) |
| 300 | Apify运行的最长秒数 |
Cost
成本
- RSS mode: Free (no API, no tokens)
- Apify mode: Uses — minimal Apify credits
jupri/rss-xml-scraper
- RSS模式: 免费(无需API,无需令牌)
- Apify模式: 使用——消耗少量Apify积分
jupri/rss-xml-scraper