news-extractor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNews Extractor Skill
News Extractor Skill
从主流新闻平台提取文章内容,输出 JSON 和 Markdown 格式。
Extract article content from mainstream news platforms and output in JSON and Markdown formats.
支持平台
Supported Platforms
| 平台 | ID | URL 示例 |
|---|---|---|
| 微信公众号 | | |
| 今日头条 | toutiao | |
| 网易新闻 | netease | |
| 搜狐新闻 | sohu | |
| 腾讯新闻 | tencent | |
| Platform | ID | URL Example |
|---|---|---|
| WeChat Official Accounts | | |
| Toutiao | toutiao | |
| NetEase News | netease | |
| Sohu News | sohu | |
| Tencent News | tencent | |
依赖安装
Dependency Installation
本 skill 使用 uv 管理依赖。首次使用前需要安装:
bash
cd ~/.claude/skills/news-extractor
uv sync重要: 所有脚本必须使用 执行,不要直接用 运行。 会自动使用项目虚拟环境中的依赖。
uv runpythonuv runThis skill uses uv for dependency management. Install dependencies before first use:
bash
cd ~/.claude/skills/news-extractor
uv syncImportant: All scripts must be executed using , not directly with . automatically uses dependencies from the project's virtual environment.
uv runpythonuv run依赖列表
Dependency List
| 包名 | 用途 |
|---|---|
| pydantic | 数据模型验证 |
| requests | HTTP 请求 |
| curl_cffi | 浏览器模拟抓取 |
| tenacity | 重试机制 |
| parsel | HTML/XPath 解析 |
| demjson3 | 非标准 JSON 解析 |
| Package | Purpose |
|---|---|
| pydantic | Data model validation |
| requests | HTTP requests |
| curl_cffi | Browser simulation crawling |
| tenacity | Retry mechanism |
| parsel | HTML/XPath parsing |
| demjson3 | Non-standard JSON parsing |
使用方式
Usage
基本用法
Basic Usage
bash
undefinedbash
undefined提取新闻,自动检测平台,输出 JSON + Markdown
Extract news, auto-detect platform, output JSON + Markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL"
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL"
指定输出目录
Specify output directory
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --output ./output
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --output ./output
仅输出 JSON
Output only JSON
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format json
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format json
仅输出 Markdown
Output only Markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format markdown
列出支持的平台
List supported platforms
uv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms
undefineduv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms
undefined输出文件
Output Files
脚本默认输出两种格式到指定目录(默认 ):
./output- - 结构化 JSON 数据
{news_id}.json - - Markdown 格式文章
{news_id}.md
The script outputs two formats to the specified directory (default ) by default:
./output- - Structured JSON data
{news_id}.json - - Markdown-formatted article
{news_id}.md
工作流程
Workflow
- 接收 URL - 用户提供新闻链接
- 平台检测 - 自动识别平台类型
- 内容提取 - 调用对应爬虫获取并解析内容
- 格式转换 - 生成 JSON 和 Markdown
- 输出文件 - 保存到指定目录
- Receive URL - User provides a news link
- Platform Detection - Automatically identify the platform type
- Content Extraction - Call the corresponding crawler to fetch and parse content
- Format Conversion - Generate JSON and Markdown
- Output Files - Save to the specified directory
输出格式
Output Formats
JSON 结构
JSON Structure
json
{
"title": "文章标题",
"news_url": "原始链接",
"news_id": "文章ID",
"meta_info": {
"author_name": "作者/来源",
"author_url": "",
"publish_time": "2024-01-01 12:00"
},
"contents": [
{"type": "text", "content": "段落文本", "desc": ""},
{"type": "image", "content": "https://...", "desc": ""},
{"type": "video", "content": "https://...", "desc": ""}
],
"texts": ["段落1", "段落2"],
"images": ["图片URL1", "图片URL2"],
"videos": []
}json
{
"title": "Article Title",
"news_url": "Original Link",
"news_id": "Article ID",
"meta_info": {
"author_name": "Author/Source",
"author_url": "",
"publish_time": "2024-01-01 12:00"
},
"contents": [
{"type": "text", "content": "Paragraph text", "desc": ""},
{"type": "image", "content": "https://...", "desc": ""},
{"type": "video", "content": "https://...", "desc": ""}
],
"texts": ["Paragraph 1", "Paragraph 2"],
"images": ["Image URL1", "Image URL2"],
"videos": []
}Markdown 结构
Markdown Structure
markdown
undefinedmarkdown
undefined文章标题
Article Title
文章信息
Article Information
作者: xxx
发布时间: 2024-01-01 12:00
原文链接: 链接
Author: xxx
Publish Time: 2024-01-01 12:00
Original Link: Link
正文内容
Article Content
段落内容...
Paragraph content...
媒体资源
Media Resources
图片 (N)
Images (N)
- URL1
- URL2
undefined- URL1
- URL2
undefined使用示例
Usage Examples
提取微信公众号文章
Extract WeChat Official Account Article
bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
"https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"输出:
[INFO] Platform detected: wechat (微信公众号)
[INFO] Extracting content...
[INFO] Title: 文章标题
[INFO] Author: 公众号名称
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.mdbash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
"https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"Output:
[INFO] Platform detected: wechat (WeChat Official Accounts)
[INFO] Extracting content...
[INFO] Title: Article Title
[INFO] Author: Official Account Name
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md提取今日头条文章
Extract Toutiao Article
bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
"https://www.toutiao.com/article/7434425099895210546/"bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
"https://www.toutiao.com/article/7434425099895210546/"错误处理
Error Handling
| 错误类型 | 说明 | 解决方案 |
|---|---|---|
| URL 不匹配任何支持的平台 | 检查 URL 是否正确 |
| 非支持的站点 | 本 Skill 仅支持列出的新闻站点 |
| 网络错误或页面结构变化 | 重试或检查 URL 有效性 |
| Error Type | Description | Solution |
|---|---|---|
| URL does not match any supported platform | Check if the URL is correct |
| Unsupported site | This Skill only supports the listed news sites |
| Network error or page structure change | Retry or check URL validity |
注意事项
Notes
- 仅用于教育和研究目的
- 不要进行大规模爬取
- 尊重目标网站的 robots.txt 和服务条款
- 微信公众号可能需要有效的 Cookie(当前默认配置通常可用)
- For educational and research purposes only
- Do not perform large-scale crawling
- Respect the target website's robots.txt and terms of service
- WeChat Official Accounts may require valid Cookies (the default configuration usually works)
参考
References
- 平台 URL 模式说明
- Platform URL Pattern Instructions