China News Crawler Skill

Extract article content from mainstream Chinese news platforms, output in JSON and Markdown formats.

Independent and Migratable: This Skill contains all necessary code, has no external dependencies, and can be directly copied to other projects for use.

Supported Platforms

Platform	ID	URL Example
WeChat Official Account	wechat	`https://mp.weixin.qq.com/s/xxxxx`
Toutiao	toutiao	`https://www.toutiao.com/article/123456/`
NetEase News	netease	`https://www.163.com/news/article/ABC123.html`
Sohu News	sohu	`https://www.sohu.com/a/123456_789`
Tencent News	tencent	`https://news.qq.com/rain/a/20251016A07W8J00`

Usage

Basic Usage

bash

# Extract news, auto-detect platform, output JSON + Markdown
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL"

# Specify output directory
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --output ./output

# Output only JSON
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format json

# Output only Markdown
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format markdown

# List supported platforms
uv run .claude/skills/china-news-crawler/scripts/extract_news.py --list-platforms

Output Files

The script by default outputs two formats to the specified directory (default

./output

```
{news_id}.json
```
- Structured JSON data
```
{news_id}.md
```
- Markdown-formatted article

Workflow

Receive URL - User provides news link
Platform Detection - Automatically identify platform type
Content Extraction - Call corresponding crawler to retrieve and parse content
Format Conversion - Generate JSON and Markdown
Output Files - Save to specified directory

Output Formats

JSON Structure

json

{
  "title": "Article Title",
  "news_url": "Original URL",
  "news_id": "Article ID",
  "meta_info": {
    "author_name": "Author/Source",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "Paragraph text", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["Paragraph 1", "Paragraph 2"],
  "images": ["Image URL 1", "Image URL 2"],
  "videos": []
}

Markdown Structure

markdown

# Article Title

## Article Information
**Author**: xxx
**Publish Time**: 2024-01-01 12:00
**Original Link**: [Link](URL)

---

## Article Content

Paragraph content...

![Image](URL)

---

## Media Resources
### Images (N)
1. URL1
2. URL2

Usage Examples

Extract WeChat Official Account Article

bash

uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"

Output:

[INFO] Platform detected: wechat (WeChat Official Account)
[INFO] Extracting content...
[INFO] Title: Article Title
[INFO] Author: Official Account Name
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md

Extract Toutiao Article

bash

uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"

Dependency Requirements

This Skill requires the following Python packages (usually pre-installed in the main project):

parsel
pydantic
requests
curl-cffi
tenacity
demjson3

Error Handling

Error Type	Description	Solution
`Unrecognized Platform`	URL does not match any supported platform	Check if the URL is correct
`Unsupported Platform`	Non-Chinese site	This Skill only supports Chinese news sites
`Extraction Failed`	Network error or page structure change	Retry or check URL validity

Notes

For educational and research purposes only
Do not perform large-scale crawling
Respect the target website's robots.txt and terms of service
WeChat Official Accounts may require valid cookies (the current default configuration usually works)

Directory Structure

china-news-crawler/
├── SKILL.md                      # [Required] Skill definition file
├── references/
│   └── platform-patterns.md      # Platform URL pattern description
└── scripts/
    ├── extract_news.py           # CLI entry script
    ├── models.py                 # Data models
    ├── detector.py               # Platform detection
    ├── formatter.py              # Markdown formatting
    └── crawlers/                 # Crawler modules
        ├── __init__.py
        ├── base.py               # BaseNewsCrawler base class
        ├── fetchers.py           # HTTP fetching strategies
        ├── wechat.py             # WeChat Official Accounts
        ├── toutiao.py            # Toutiao
        ├── netease.py            # NetEase News
        ├── sohu.py               # Sohu News
        └── tencent.py            # Tencent News

References

Platform URL Pattern Description

china-news-crawler

NPX Install

Tags

SKILL.md Content (Chinese)

China News Crawler Skill

Supported Platforms

Usage

Basic Usage

Output Files

Workflow

Output Formats

JSON Structure

Markdown Structure

Usage Examples

Extract WeChat Official Account Article

Extract Toutiao Article

Dependency Requirements

Error Handling

Notes

Directory Structure

References