Loading...
Loading...
Compare original and translation side by side
command -v readernpm install -g @mozilla/readability-clicommand -v readernpm install -g @mozilla/readability-cliundefinedundefinedcommand -v trafilaturapip3 install trafilaturacommand -v trafilaturapip3 install trafilaturaundefinedundefined
**Pros:**
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structure
**优势:**
- 基于Mozilla的Readability算法
- 移除冗余内容的效果极佳
- 保留文章结构undefinedundefined
**Pros:**
- Very accurate extraction
- Good with various site structures
- Handles multiple languages
**Options:**
- `--no-comments`: Skip comment sections
- `--no-tables`: Skip data tables
- `--precision`: Favor precision over recall
- `--recall`: Extract more content (may include some noise)
**优势:**
- 提取准确率极高
- 适配多种网站结构
- 支持多语言
**参数说明:**
- `--no-comments`:跳过评论区
- `--no-tables`:跳过数据表格
- `--precision`:优先保障提取准确率
- `--recall`:提取更多内容(可能包含少量噪声)undefinedundefineddef handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)
**Note:** This is less reliable but works without dependencies.def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
self.in_content = True
self.current_tag = tag
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)
**注意:** 该方案可靠性较低,但无需依赖其他工具即可运行。undefinedundefinedundefinedundefinedundefinedundefinedundefinedundefinedTITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')undefinedundefinedundefinedundefinedARTICLE_URL="https://example.com/article"ARTICLE_URL="https://example.com/article"def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)def handle_starttag(self, tag, attrs):
if tag not in self.skip_tags:
if tag in {'p', 'article', 'main'}:
self.in_content = True
if tag in {'h1', 'h2', 'h3'}:
self.content.append('\\n')
def handle_data(self, data):
if self.in_content and data.strip():
self.content.append(data.strip())
def get_content(self):
return '\\n\\n'.join(self.content)undefinedundefined/:?"<>|-/:?"<>|-undefinedundefined
**With error handling:**
```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fi
**带错误处理的提取:**
```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
if command -v trafilatura &> /dev/null; then
trafilatura --URL "$URL" --output-format txt > temp.txt
else
echo "Error: Could not extract article. Install reader or trafilatura."
exit 1
fi
fi