article-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Article Extractor

文章提取器

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.
本技能可从网页文章和博客帖子中提取主要内容,移除导航栏、广告、新闻通讯注册表单及其他杂乱内容,保存为干净、易读的文本。

When to Use This Skill

何时使用本技能

Activate when the user:
  • Provides an article/blog URL and wants the text content
  • Asks to "download this article"
  • Wants to "extract the content from [URL]"
  • Asks to "save this blog post as text"
  • Needs clean article text without distractions
当用户有以下需求时激活:
  • 提供文章/博客URL并想要获取文本内容
  • 请求“下载这篇文章”
  • 想要“从[URL]提取内容”
  • 请求“将这篇博客帖子保存为文本”
  • 需要无干扰的干净文章文本

How It Works

工作原理

Priority Order:

优先级顺序:

  1. Check if tools are installed (reader or trafilatura)
  2. Download and extract article using best available tool
  3. Clean up the content (remove extra whitespace, format properly)
  4. Save to file with article title as filename
  5. Confirm location and show preview
  1. 检查工具是否已安装(reader或trafilatura)
  2. 使用最优可用工具下载并提取文章
  3. 清理内容(移除多余空白、规范格式)
  4. 以文章标题为文件名保存到文件
  5. 确认保存位置并显示预览

Installation Check

安装检查

Check for article extraction tools in this order:
按以下顺序检查文章提取工具:

Option 1: reader (Recommended - Mozilla's Readability)

选项1:reader(推荐 - 基于Mozilla的Readability)

bash
command -v reader
If not installed:
bash
npm install -g @mozilla/readability-cli
bash
command -v reader
如果未安装:
bash
npm install -g @mozilla/readability-cli

or

or

npm install -g reader-cli
undefined
npm install -g reader-cli
undefined

Option 2: trafilatura (Python-based, very good)

选项2:trafilatura(基于Python,性能优异)

bash
command -v trafilatura
If not installed:
bash
pip3 install trafilatura
bash
command -v trafilatura
如果未安装:
bash
pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

选项3:备选方案(curl + 简单解析)

If no tools available, use basic curl + text extraction (less reliable but works)
如果没有可用工具,使用基础curl + 文本提取(可靠性较低但可用)

Extraction Methods

提取方法

Method 1: Using reader (Best for most articles)

方法1:使用reader(适用于大多数文章)

bash
undefined
bash
undefined

Extract article

提取文章

reader "URL" > article.txt

**Pros:**
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structure
reader "URL" > article.txt

**优点:**
- 基于Mozilla的Readability算法
- 移除杂乱内容效果出色
- 保留文章结构

Method 2: Using trafilatura (Best for blogs/news)

方法2:使用trafilatura(适用于博客/新闻站点)

bash
undefined
bash
undefined

Extract article

提取文章

trafilatura --URL "URL" --output-format txt > article.txt
trafilatura --URL "URL" --output-format txt > article.txt

Or with more options

或使用更多选项

trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

**Pros:**
- Very accurate extraction
- Good with various site structures
- Handles multiple languages

**Options:**
- `--no-comments`: Skip comment sections
- `--no-tables`: Skip data tables
- `--precision`: Favor precision over recall
- `--recall`: Extract more content (may include some noise)
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

**优点:**
- 提取精度极高
- 适配多种站点结构
- 支持多语言

**可选参数:**
- `--no-comments`: 跳过评论区
- `--no-tables`: 跳过数据表格
- `--precision`: 优先保证提取精度
- `--recall`: 提取更多内容(可能包含少量干扰信息)

Method 3: Fallback (curl + basic parsing)

方法3:备选方案(curl + 基础解析)

bash
undefined
bash
undefined

Download and extract basic content

下载并提取基础内容

curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys
class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None
def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
            self.in_content = True
            self.current_tag = tag

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)
parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt

**Note:** This is less reliable but works without dependencies.
curl -s "URL" | python3 -c " from html.parser import HTMLParser import sys
class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'} self.current_tag = None
def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
            self.in_content = True
            self.current_tag = tag

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)
parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > article.txt

**注意:** 此方法可靠性较低,但无需依赖其他工具即可运行。

Getting Article Title

获取文章标题

Extract title for filename:
提取标题用于命名文件:

Using reader:

使用reader:

bash
undefined
bash
undefined

reader outputs markdown with title at top

reader输出的markdown格式中标题位于顶部

TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
undefined
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
undefined

Using trafilatura:

使用trafilatura:

bash
undefined
bash
undefined

Get metadata including title

获取包含标题的元数据

TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
undefined
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
undefined

Using curl (fallback):

使用curl(备选方案):

bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

Filename Creation

文件名生成

Clean title for filesystem:
bash
undefined
清理标题以适配文件系统:
bash
undefined

Get title

获取标题

TITLE="Article Title from Website"
TITLE="Article Title from Website"

Clean for filesystem (remove special chars, limit length)

清理标题以适配文件系统(移除特殊字符,限制长度)

FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

Add extension

添加扩展名

FILENAME="${FILENAME}.txt"
undefined
FILENAME="${FILENAME}.txt"
undefined

Complete Workflow

完整工作流

bash
ARTICLE_URL="https://example.com/article"
bash
ARTICLE_URL="https://example.com/article"

Check for tools

检查可用工具

if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi
if command -v reader &> /dev/null; then TOOL="reader" echo "Using reader (Mozilla Readability)" elif command -v trafilatura &> /dev/null; then TOOL="trafilatura" echo "Using trafilatura" else TOOL="fallback" echo "Using fallback method (may be less accurate)" fi

Extract article

提取文章

case $TOOL in reader) # Get content reader "$ARTICLE_URL" > temp_article.txt # Get title (first line after # in markdown) TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # Get title from metadata METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # Get clean content trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # Get title TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # Remove site name TITLE=${TITLE%% | *} # Remove site name (alternate) # Get content (basic extraction) curl -s "$ARTICLE_URL" | python3 -c " from html.parser import HTMLParser import sys
class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main'}:
            self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\\n')

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)
parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac
case $TOOL in reader) # 获取内容 reader "$ARTICLE_URL" > temp_article.txt # 获取标题(markdown格式中的第一行#后的内容) TITLE=$(head -n 1 temp_article.txt | sed 's/^# //') ;; trafilatura) # 从元数据中获取标题 METADATA=$(trafilatura --URL "$ARTICLE_URL" --json) TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))") # 获取干净内容 trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt ;; fallback) # 获取标题 TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1) TITLE=${TITLE%% - *} # 移除站点名称 TITLE=${TITLE%% | *} # 移除站点名称(备选方式) # 获取内容(基础提取) curl -s "$ARTICLE_URL" | python3 -c " from html.parser import HTMLParser import sys
class ArticleExtractor(HTMLParser): def init(self): super().init() self.in_content = False self.content = [] self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}
def handle_starttag(self, tag, attrs):
    if tag not in self.skip_tags:
        if tag in {'p', 'article', 'main'}:
            self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\\n')

def handle_data(self, data):
    if self.in_content and data.strip():
        self.content.append(data.strip())

def get_content(self):
    return '\\n\\n'.join(self.content)
parser = ArticleExtractor() parser.feed(sys.stdin.read()) print(parser.get_content()) " > temp_article.txt ;; esac

Clean filename

清理文件名

FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//') FILENAME="${FILENAME}.txt"
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//') FILENAME="${FILENAME}.txt"

Move to final filename

移动到最终文件名

mv temp_article.txt "$FILENAME"
mv temp_article.txt "$FILENAME"

Show result

显示结果

echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"
undefined
echo "✓ Extracted article: $TITLE" echo "✓ Saved to: $FILENAME" echo "" echo "Preview (first 10 lines):" head -n 10 "$FILENAME"
undefined

Error Handling

错误处理

Common Issues

常见问题

1. Tool not installed
  • Try alternate tool (reader → trafilatura → fallback)
  • Offer to install: "Install reader with: npm install -g reader-cli"
2. Paywall or login required
  • Extraction tools may fail
  • Inform user: "This article requires authentication. Cannot extract."
3. Invalid URL
  • Check URL format
  • Try with and without redirects
4. No content extracted
  • Site may use heavy JavaScript
  • Try fallback method
  • Inform user if extraction fails
5. Special characters in title
  • Clean title for filesystem
  • Remove:
    /
    ,
    :
    ,
    ?
    ,
    "
    ,
    <
    ,
    >
    ,
    |
  • Replace with
    -
    or remove
1. 工具未安装
  • 尝试使用备选工具(reader → trafilatura → 备选方案)
  • 提示用户安装:“使用以下命令安装reader:npm install -g reader-cli”
2. 需要付费墙或登录
  • 提取工具可能失败
  • 告知用户:“这篇文章需要验证身份,无法提取。”
3. URL无效
  • 检查URL格式
  • 尝试跟随或不跟随重定向
4. 未提取到内容
  • 站点可能使用大量JavaScript
  • 尝试使用备选方案
  • 如果提取失败,告知用户
5. 标题包含特殊字符
  • 清理标题以适配文件系统
  • 移除:
    /
    ,
    :
    ,
    ?
    ,
    "
    ,
    <
    ,
    >
    ,
    |
  • 替换为
    -
    或直接移除

Output Format

输出格式

Saved File Contains:

保存的文件包含:

  • Article title (if available)
  • Author (if available from tool)
  • Main article text
  • Section headings
  • No navigation, ads, or clutter
  • 文章标题(如果可用)
  • 作者(如果工具能获取到)
  • 文章主要文本
  • 章节标题
  • 无导航栏、广告或杂乱内容

What Gets Removed:

会被移除的内容:

  • Navigation menus
  • Ads and promotional content
  • Newsletter signup forms
  • Related articles sidebars
  • Comment sections (optional)
  • Social media buttons
  • Cookie notices
  • 导航菜单
  • 广告和推广内容
  • 新闻通讯注册表单
  • 相关文章侧边栏
  • 评论区(可选)
  • 社交媒体按钮
  • Cookie通知

Tips for Best Results

最佳使用技巧

1. Use reader for most articles
  • Best all-around tool
  • Based on Firefox Reader View
  • Works on most news sites and blogs
2. Use trafilatura for:
  • Academic articles
  • News sites
  • Blogs with complex layouts
  • Non-English content
3. Fallback method limitations:
  • May include some noise
  • Less accurate paragraph detection
  • Better than nothing for simple sites
4. Check extraction quality:
  • Always show preview to user
  • Ask if it looks correct
  • Offer to try different tool if needed
1. 大多数文章使用reader
  • 综合表现最佳的工具
  • 基于Firefox阅读器视图
  • 适用于大多数新闻站点和博客
2. 以下场景使用trafilatura:
  • 学术文章
  • 新闻站点
  • 布局复杂的博客
  • 非英语内容
3. 备选方案的局限性:
  • 可能包含少量干扰信息
  • 段落检测精度较低
  • 对于简单站点聊胜于无
4. 检查提取质量:
  • 始终向用户显示预览
  • 询问用户内容是否正确
  • 如果需要,提供使用其他工具重试的选项

Example Usage

使用示例

Simple extraction:
bash
undefined
简单提取:
bash
undefined
reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"

**With error handling:**

```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi
reader "https://example.com/article" > temp.txt TITLE=$(head -n 1 temp.txt | sed 's/^# //') FILENAME="$(echo "$TITLE" | tr '/' '-').txt" mv temp.txt "$FILENAME" echo "✓ Saved to: $FILENAME"

**带错误处理的提取:**

```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi

Best Practices

最佳实践

  • ✅ Always show preview after extraction (first 10 lines)
  • ✅ Verify extraction succeeded before saving
  • ✅ Clean filename for filesystem compatibility
  • ✅ Try fallback method if primary fails
  • ✅ Inform user which tool was used
  • ✅ Keep filename length reasonable (< 100 chars)
  • ✅ 提取后始终显示预览(前10行)
  • ✅ 保存前验证提取是否成功
  • ✅ 清理文件名以适配文件系统
  • ✅ 如果主工具失败,尝试备选方案
  • ✅ 告知用户使用的工具
  • ✅ 控制文件名长度在合理范围(<100字符)

After Extraction

提取完成后

Display to user:
  1. "✓ Extracted: [Article Title]"
  2. "✓ Saved to: [filename]"
  3. Show preview (first 10-15 lines)
  4. File size and location
Ask if needed:
  • "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
  • "Should I extract another article?"
向用户展示:
  1. "✓ 已提取:[文章标题]"
  2. "✓ 已保存至:[文件名]"
  3. 显示预览(前10-15行)
  4. 文件大小和位置
可询问用户:
  • “是否需要我基于此内容创建Ship-Learn-Next计划?”(如果使用ship-learn-next技能)
  • “是否需要提取另一篇文章?”