blog-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Blog Scraper

博客抓取工具

Scrape blog posts via RSS/Atom feeds (free) with optional Apify fallback for JS-heavy sites.
通过RSS/Atom源免费抓取博客文章,针对JS密集型网站可选择使用Apify作为备选方案。

Quick Start

快速开始

For RSS mode (free), only dependency is
pip install requests
. No API key needed.
bash
undefined
对于RSS模式(免费),仅需安装依赖
pip install requests
,无需API密钥。
bash
undefined

Scrape a blog's RSS feed

Scrape a blog's RSS feed

python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com/blog" --days 30
python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com/blog" --days 30

Multiple blogs with keyword filter

Multiple blogs with keyword filter

python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://blog1.com,https://blog2.com" --keywords "AI,marketing" --output summary
python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://blog1.com,https://blog2.com" --keywords "AI,marketing" --output summary

Force Apify for JS-heavy sites

Force Apify for JS-heavy sites

python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com" --mode apify
undefined
python3 skills/blog-scraper/scripts/scrape_blogs.py
--urls "https://example.com" --mode apify
undefined

How It Works

工作原理

Auto Mode (default)

自动模式(默认)

  1. For each URL, tries to discover an RSS/Atom feed:
    • Checks HTML
      <link rel="alternate">
      tags
    • Probes common paths:
      /feed
      ,
      /rss
      ,
      /atom.xml
      ,
      /feed.xml
      ,
      /rss.xml
      ,
      /blog/feed
      ,
      /index.xml
  2. Parses discovered feeds (supports RSS 2.0 and Atom)
  3. If any URLs fail, falls back to Apify
    jupri/rss-xml-scraper
    (if token available)
  4. Applies date and keyword filtering client-side
  1. 针对每个URL,尝试发现RSS/Atom源:
    • 检查HTML
      <link rel="alternate">
      标签
    • 探测常见路径:
      /feed
      /rss
      /atom.xml
      /feed.xml
      /rss.xml
      /blog/feed
      /index.xml
  2. 解析发现的源(支持RSS 2.0和Atom格式)
  3. 若任何URL抓取失败,将 fallback 到Apify
    jupri/rss-xml-scraper
    (需提供可用令牌)
  4. 在客户端应用日期和关键词过滤

RSS Mode

RSS模式

Only tries RSS feeds, no Apify fallback.
仅尝试使用RSS源,无Apify备选方案。

Apify Mode

Apify模式

Uses Apify actor directly, skipping RSS discovery.
直接使用Apify actor,跳过RSS源发现步骤。

CLI Reference

CLI参考

FlagDefaultDescription
--urls
requiredBlog URL(s), comma-separated
--keywords
noneKeywords to filter (comma-separated, OR logic)
--days
30Only include posts from last N days
--max-posts
50Max posts to return
--mode
auto
auto
(RSS + fallback),
rss
(RSS only),
apify
(Apify only)
--output
jsonOutput format:
json
or
summary
--token
env varApify token (only needed for Apify mode/fallback)
--timeout
300Max seconds for Apify run
标志默认值说明
--urls
必填博客URL(多个用逗号分隔)
--keywords
用于过滤的关键词(多个用逗号分隔,逻辑或)
--days
30仅包含最近N天内的文章
--max-posts
50返回的最大文章数量
--mode
auto
auto
(RSS+备选)、
rss
(仅RSS)、
apify
(仅Apify)
--output
json输出格式:
json
summary
--token
环境变量Apify令牌(仅在Apify模式/备选方案中需要)
--timeout
300Apify运行的最长秒数

Cost

成本

  • RSS mode: Free (no API, no tokens)
  • Apify mode: Uses
    jupri/rss-xml-scraper
    — minimal Apify credits
  • RSS模式: 免费(无需API,无需令牌)
  • Apify模式: 使用
    jupri/rss-xml-scraper
    ——消耗少量Apify积分