article-extractor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Article Extractor

文章提取器

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

本技能可从网页文章和博客帖子中提取主要内容，移除导航栏、广告、新闻通讯注册表单及其他杂乱内容，保存为干净、易读的文本。

When to Use This Skill

何时使用本技能

Activate when the user:

Provides an article/blog URL and wants the text content
Asks to "download this article"
Wants to "extract the content from [URL]"
Asks to "save this blog post as text"
Needs clean article text without distractions

当用户有以下需求时激活：

提供文章/博客URL并想要获取文本内容
请求“下载这篇文章”
想要“从[URL]提取内容”
请求“将这篇博客帖子保存为文本”
需要无干扰的干净文章文本

How It Works

工作原理

Priority Order:

优先级顺序：

Check if tools are installed (reader or trafilatura)
Download and extract article using best available tool
Clean up the content (remove extra whitespace, format properly)
Save to file with article title as filename
Confirm location and show preview

检查工具是否已安装（reader或trafilatura）
使用最优可用工具下载并提取文章
清理内容（移除多余空白、规范格式）
以文章标题为文件名保存到文件
确认保存位置并显示预览

Installation Check

安装检查

Check for article extraction tools in this order:

按以下顺序检查文章提取工具：

Option 1: reader (Recommended - Mozilla's Readability)

选项1：reader（推荐 - 基于Mozilla的Readability）

bash

command -v reader

If not installed:

bash

npm install -g @mozilla/readability-cli

bash

command -v reader

如果未安装：

bash

npm install -g @mozilla/readability-cli

or

npm install -g reader-cli

undefined

npm install -g reader-cli

undefined

Option 2: trafilatura (Python-based, very good)

选项2：trafilatura（基于Python，性能优异）

bash

command -v trafilatura

If not installed:

bash

pip3 install trafilatura

bash

command -v trafilatura

如果未安装：

bash

pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

选项3：备选方案（curl + 简单解析）

If no tools available, use basic curl + text extraction (less reliable but works)

如果没有可用工具，使用基础curl + 文本提取（可靠性较低但可用）

Extraction Methods

提取方法

Method 1: Using reader (Best for most articles)

方法1：使用reader（适用于大多数文章）

bash

undefined

bash

undefined

Extract article

提取文章

reader "URL" > article.txt


**Pros:**
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structure

reader "URL" > article.txt


**优点：**
- 基于Mozilla的Readability算法
- 移除杂乱内容效果出色
- 保留文章结构

Method 2: Using trafilatura (Best for blogs/news)

方法2：使用trafilatura（适用于博客/新闻站点）

bash

undefined

bash

undefined

Extract article

提取文章

trafilatura --URL "URL" --output-format txt > article.txt

Or with more options

或使用更多选项

trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt


**Pros:**
- Very accurate extraction
- Good with various site structures
- Handles multiple languages

**Options:**
- `--no-comments`: Skip comment sections
- `--no-tables`: Skip data tables
- `--precision`: Favor precision over recall
- `--recall`: Extract more content (may include some noise)

trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt


**优点：**
- 提取精度极高
- 适配多种站点结构
- 支持多语言

**可选参数：**
- `--no-comments`: 跳过评论区
- `--no-tables`: 跳过数据表格
- `--precision`: 优先保证提取精度
- `--recall`: 提取更多内容（可能包含少量干扰信息）

Method 3: Fallback (curl + basic parsing)

方法3：备选方案（curl + 基础解析）

bash

undefined

bash

undefined

Download and extract basic content

下载并提取基础内容

curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys

class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None

def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
            self.in_content = True
            self.current_tag = tag

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)

parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt


**Note:** This is less reliable but works without dependencies.

curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys

def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
            self.in_content = True
            self.current_tag = tag

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)

parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt


**注意：** 此方法可靠性较低，但无需依赖其他工具即可运行。

Getting Article Title

获取文章标题

Extract title for filename:

提取标题用于命名文件：

Using reader:

使用reader：

bash

undefined

bash

undefined

reader outputs markdown with title at top

reader输出的markdown格式中标题位于顶部

TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

undefined

TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

undefined

Using trafilatura:

使用trafilatura：

bash

undefined

bash

undefined

Get metadata including title

获取包含标题的元数据

TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

undefined

TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

undefined

Using curl (fallback):

使用curl（备选方案）：

bash

TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

bash

TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

Filename Creation

文件名生成

Clean title for filesystem:

bash

undefined

清理标题以适配文件系统：

bash

undefined

Get title

获取标题

TITLE="Article Title from Website"

Clean for filesystem (remove special chars, limit length)

清理标题以适配文件系统（移除特殊字符，限制长度）

FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

Add extension

添加扩展名

FILENAME="${FILENAME}.txt"

undefined

FILENAME="${FILENAME}.txt"

undefined

Complete Workflow

完整工作流

bash

ARTICLE_URL="https://example.com/article"

bash

ARTICLE_URL="https://example.com/article"

Check for tools

检查可用工具

if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi

Extract article

提取文章

case $TOOL in reader) # Get content reader "$ARTICLE_URL" > temp_article.txt # Get title (first line after # in markdown) TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # Get title from metadata METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # Get clean content trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # Get title TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # Remove site name TITLE=${TITLE%% | *} # Remove site name (alternate) # Get content (basic extraction) curl -s "$ARTICLE_URL" | python3 -c " from html.parser import HTMLParser import sys

class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main'}:
            self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\\n')

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)

parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac

case $TOOL in reader) # 获取内容 reader "$ARTICLE_URL" > temp_article.txt # 获取标题（markdown格式中的第一行#后的内容） TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # 从元数据中获取标题 METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # 获取干净内容 trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # 获取标题 TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # 移除站点名称 TITLE=${TITLE%% | *} # 移除站点名称（备选方式） # 获取内容（基础提取） curl -s "$ARTICLE_URL" | python3 -c " from html.parser import HTMLParser import sys

class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main'}:
            self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\\n')

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)

parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac

Clean filename

清理文件名

FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//') FILENAME="${FILENAME}.txt"

Move to final filename

移动到最终文件名

mv temp_article.txt "$FILENAME"

Show result

显示结果

echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"

undefined

echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"

undefined

Error Handling

错误处理

Common Issues

常见问题

1. Tool not installed

Try alternate tool (reader → trafilatura → fallback)
Offer to install: "Install reader with: npm install -g reader-cli"

2. Paywall or login required

Extraction tools may fail
Inform user: "This article requires authentication. Cannot extract."

3. Invalid URL

Check URL format
Try with and without redirects

4. No content extracted

Site may use heavy JavaScript
Try fallback method
Inform user if extraction fails

5. Special characters in title

Clean title for filesystem
Remove:
```
/
```
,
```
:
```
,
```
?
```
,
```
"
```
,
```
<
```
,
```
>
```
,
```
|
```
Replace with
```
-
```
or remove

1. 工具未安装

尝试使用备选工具（reader → trafilatura → 备选方案）
提示用户安装：“使用以下命令安装reader：npm install -g reader-cli”

2. 需要付费墙或登录

提取工具可能失败
告知用户：“这篇文章需要验证身份，无法提取。”

3. URL无效

检查URL格式
尝试跟随或不跟随重定向

4. 未提取到内容

站点可能使用大量JavaScript
尝试使用备选方案
如果提取失败，告知用户

5. 标题包含特殊字符

清理标题以适配文件系统
移除：
```
/
```
,
```
:
```
,
```
?
```
,
```
"
```
,
```
<
```
,
```
>
```
,
```
|
```
替换为
```
-
```
或直接移除

Output Format

输出格式

Saved File Contains:

保存的文件包含：

Article title (if available)
Author (if available from tool)
Main article text
Section headings
No navigation, ads, or clutter

文章标题（如果可用）
作者（如果工具能获取到）
文章主要文本
章节标题
无导航栏、广告或杂乱内容

What Gets Removed:

会被移除的内容：

Navigation menus
Ads and promotional content
Newsletter signup forms
Related articles sidebars
Comment sections (optional)
Social media buttons
Cookie notices

导航菜单
广告和推广内容
新闻通讯注册表单
相关文章侧边栏
评论区（可选）
社交媒体按钮
Cookie通知

Tips for Best Results

最佳使用技巧

1. Use reader for most articles

Best all-around tool
Based on Firefox Reader View
Works on most news sites and blogs

2. Use trafilatura for:

Academic articles
News sites
Blogs with complex layouts
Non-English content

3. Fallback method limitations:

May include some noise
Less accurate paragraph detection
Better than nothing for simple sites

4. Check extraction quality:

Always show preview to user
Ask if it looks correct
Offer to try different tool if needed

1. 大多数文章使用reader

综合表现最佳的工具
基于Firefox阅读器视图
适用于大多数新闻站点和博客

2. 以下场景使用trafilatura：

学术文章
新闻站点
布局复杂的博客
非英语内容

3. 备选方案的局限性：

可能包含少量干扰信息
段落检测精度较低
对于简单站点聊胜于无

4. 检查提取质量：

始终向用户显示预览
询问用户内容是否正确
如果需要，提供使用其他工具重试的选项

Example Usage

使用示例

Simple extraction:

bash

undefined

简单提取：

bash

undefined

User: "Extract https://example.com/article"

用户：“提取https://example.com/article”

reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"


**With error handling:**

```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi

reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"


**带错误处理的提取：**

```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi

Best Practices

最佳实践

✅ Always show preview after extraction (first 10 lines)
✅ Verify extraction succeeded before saving
✅ Clean filename for filesystem compatibility
✅ Try fallback method if primary fails
✅ Inform user which tool was used
✅ Keep filename length reasonable (< 100 chars)

✅ 提取后始终显示预览（前10行）
✅ 保存前验证提取是否成功
✅ 清理文件名以适配文件系统
✅ 如果主工具失败，尝试备选方案
✅ 告知用户使用的工具
✅ 控制文件名长度在合理范围（<100字符）

After Extraction

提取完成后

Display to user:

"✓ Extracted: [Article Title]"
"✓ Saved to: [filename]"
Show preview (first 10-15 lines)
File size and location

Ask if needed:

"Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
"Should I extract another article?"

向用户展示：

"✓ 已提取：[文章标题]"
"✓ 已保存至：[文件名]"
显示预览（前10-15行）
文件大小和位置

可询问用户：

“是否需要我基于此内容创建Ship-Learn-Next计划？”（如果使用ship-learn-next技能）
“是否需要提取另一篇文章？”