blog-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Chinese

Scrape blog posts via RSS/Atom feeds (free) with optional Apify fallback for JS-heavy sites.

通过RSS/Atom源免费抓取博客文章，针对JS密集型网站可选择使用Apify作为备选方案。

For RSS mode (free), only dependency is

pip install requests

. No API key needed.

bash

undefined

对于RSS模式（免费），仅需安装依赖

pip install requests

，无需API密钥。

bash

undefined

python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com/blog" --days 30

python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://blog1.com,https://blog2.com" --keywords "AI,marketing" --output summary

python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com" --mode apify

undefined

python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com" --mode apify

undefined

For each URL, tries to discover an RSS/Atom feed:
- Checks HTML
```
<link rel="alternate">
```
  tags
- Probes common paths:
```
/feed
```
  ,
```
/rss
```
  ,
```
/atom.xml
```
  ,
```
/feed.xml
```
  ,
```
/rss.xml
```
  ,
```
/blog/feed
```
  ,
```
/index.xml
```
Parses discovered feeds (supports RSS 2.0 and Atom)
If any URLs fail, falls back to Apify
```
jupri/rss-xml-scraper
```
(if token available)
Applies date and keyword filtering client-side

Only tries RSS feeds, no Apify fallback.

仅尝试使用RSS源，无Apify备选方案。

Uses Apify actor directly, skipping RSS discovery.

直接使用Apify actor，跳过RSS源发现步骤。

Flag	Default	Description
`--urls`	required	Blog URL(s), comma-separated
`--keywords`	none	Keywords to filter (comma-separated, OR logic)
`--days`	30	Only include posts from last N days
`--max-posts`	50	Max posts to return
`--mode`	auto	`auto` (RSS + fallback), `rss` (RSS only), `apify` (Apify only)
`--output`	json	Output format: `json` or `summary`
`--token`	env var	Apify token (only needed for Apify mode/fallback)
`--timeout`	300	Max seconds for Apify run

标志	默认值	说明
`--urls`	必填	博客URL（多个用逗号分隔）
`--keywords`	无	用于过滤的关键词（多个用逗号分隔，逻辑或）
`--days`	30	仅包含最近N天内的文章
`--max-posts`	50	返回的最大文章数量
`--mode`	auto	`auto` （RSS+备选）、 `rss` （仅RSS）、 `apify` （仅Apify）
`--output`	json	输出格式： `json` 或 `summary`
`--token`	环境变量	Apify令牌（仅在Apify模式/备选方案中需要）
`--timeout`	300	Apify运行的最长秒数