web-download

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Download

Web Download

Overview

Overview

node-list.txt
中的每个节点进行网络调研,收集并保存可验证、可追溯的参考资料。多个子代理并行工作,每个子代理负责一个或多个节点的材料收集。
Conduct web research for each node in
node-list.txt
, collect and save verifiable, traceable reference materials. Multiple sub-agents work in parallel, with each sub-agent responsible for collecting materials for one or more nodes.

Workflow

Workflow

1. 配置参数(开始前必做)

1. Configure Parameters (Mandatory Before Starting)

使用AskUserQuestion工具询问用户配置参数,确保API调用频率合理:
问题:同时开启多少个子代理进行并行调研?
选项:
- 1个:最保守,适合有限资源场景
- 2个:默认推荐,平衡效率与稳定性
- 3个:适中,适合节点较多的场景
- agent自己决定:根据节点数量智能调整(最多3个)
问题:每个节点最多进行几次Web Search搜索?
选项:
- 1次:快速收集基础资料
- 2次:默认推荐,平衡覆盖面与效率
- 3次:深入收集,适合重要节点
问题:每次搜索结果最多进行几次Web Fetch读取网页?
选项:
- 1次:仅读取最相关的结果
- 2次:读取前2个相关结果
- 3次:默认推荐,充分覆盖搜索结果
问题:每次搜索结果最多保存几个网页/文档?
选项:
- 1个:仅保存最相关的资料
- 2个:保存前2个相关资料
- 3个:默认推荐,确保资料多样性
默认配置(为避免API调用频率过高):
  • 子代理数量:最多2个
  • 每节点搜索次数:最多2次
  • 每次搜索Web Fetch次数:最多3次
  • 每次搜索保存次数:最多3个
Use the AskUserQuestion tool to ask users for configuration parameters to ensure reasonable API call frequency:
Question: How many sub-agents to launch for parallel research?
Options:
- 1: Most conservative, suitable for resource-constrained scenarios
- 2: Default recommendation, balances efficiency and stability
- 3: Moderate, suitable for scenarios with many nodes
- Agent decides automatically: Intelligently adjust based on the number of nodes (max 3)
Question: What's the maximum number of Web Search attempts per node?
Options:
- 1: Quickly collect basic information
- 2: Default recommendation, balances coverage and efficiency
- 3: In-depth collection, suitable for important nodes
Question: What's the maximum number of Web Fetch attempts to read webpages per search result?
Options:
- 1: Only read the most relevant result
- 2: Read the top 2 relevant results
- 3: Default recommendation, fully covers search results
Question: What's the maximum number of webpages/documents to save per search result?
Options:
- 1: Only save the most relevant material
- 2: Save the top 2 relevant materials
- 3: Default recommendation, ensures material diversity
Default Configuration (to avoid excessive API calls):
  • Number of sub-agents: Max 2
  • Search attempts per node: Max 2
  • Web Fetch attempts per search: Max 3
  • Save attempts per search: Max 3

2. 读取节点列表

2. Read Node List

node-list.txt
读取待处理的节点列表。
Read the list of nodes to process from
node-list.txt
.

3. 并行调研策略

3. Parallel Research Strategy

根据用户配置启动子代理(使用Task工具并行执行):
  • 严格限制子代理数量不超过用户设定值
  • 每个子代理处理1-2个节点
  • 将节点列表平均分配给各子代理
示例分配(6个节点,2个子代理):
子代理1: 节点1, 节点2, 节点3
子代理2: 节点4, 节点5, 节点6
Launch sub-agents according to user configuration (execute in parallel using the Task tool):
  • Strictly limit the number of sub-agents to the user-specified value
  • Each sub-agent handles 1-2 nodes
  • Distribute the node list evenly among sub-agents
Example Allocation (6 nodes, 2 sub-agents):
Sub-agent 1: Node 1, Node 2, Node 3
Sub-agent 2: Node 4, Node 5, Node 6

4. 深度检索方法(严格限制)

4. Deep Retrieval Method (Strict Restrictions)

搜索策略(严格限制搜索次数):
  • 每个节点最多进行用户配置次数的Web Search
  • 优先使用不同的关键词组合获取多样化结果
  • 包含中英文双语搜索(在限制次数内)
搜索关键词构建(在限制次数内选择):
第1次搜索:"{节点名称}"
第2次搜索:"{节点名称} 原理 教程" 或 "{节点名称} guide"
Web Fetch限制
  • 每次搜索结果最多进行用户配置次数的Web Fetch
  • 优先选择官方文档和权威来源
  • 跳过重复或低质量的URL
保存限制
  • 每次搜索结果最多保存用户配置数量的网页
  • 优先保存完整度高、内容丰富的资料
Search Strategy (strictly limited by search attempts):
  • Each node can have a maximum of user-configured Web Search attempts
  • Prioritize using different keyword combinations to obtain diverse results
  • Include both Chinese and English searches (within the limited attempts)
Search Keyword Construction (select within limited attempts):
1st search: "{node name}"
2nd search: "{node name} principle tutorial" or "{node name} guide"
Web Fetch Restrictions:
  • A maximum of user-configured Web Fetch attempts per search result
  • Prioritize official documents and authoritative sources
  • Skip duplicate or low-quality URLs
Save Restrictions:
  • A maximum of user-configured webpages saved per search result
  • Prioritize saving materials with high completeness and rich content

4. 资料收集与保存

4. Material Collection and Saving

目标资料类型
  • 技术文档与官方指南
  • 学术论文与研究报告
  • 技术博客与教程
  • 实践案例与代码示例
保存规则
  1. 创建
    materials/
    目录存储所有资料
  2. 使用web_reader工具获取完整网页内容
  3. 每个资料保存为独立文件,命名格式:
    {节点索引}_{来源标识}.{ext}
  4. 支持的文件格式:
    • .md
      - Markdown格式内容
    • .txt
      - 纯文本内容
    • .json
      - 结构化数据
Target Material Types:
  • Technical documents and official guides
  • Academic papers and research reports
  • Technical blogs and tutorials
  • Practical cases and code examples
Saving Rules:
  1. Create a
    materials/
    directory to store all materials
  2. Use the web_reader tool to obtain complete webpage content
  3. Save each material as an independent file, naming format:
    {node index}_{source identifier}.{ext}
  4. Supported file formats:
    • .md
      - Markdown format content
    • .txt
      - Plain text content
    • .json
      - Structured data

5. 输出格式

5. Output Format

创建
download.txt
文件:
节点1内容: {节点1_材料1.md: 来源URL1}, {节点1_材料2.md: 来源URL2}
节点2内容: {节点2_材料1.md: 来源URL1}, {节点2_材料2.md: 来源URL2}
...
文件命名规范
  • 使用
    {序号}_{简短描述}.{扩展名}
    格式
  • 序号与node-list.txt中的行号对应
  • 简短描述反映资料主题
Create a
download.txt
file:
Node 1 content: {node1_material1.md: source URL1}, {node1_material2.md: source URL2}
Node 2 content: {node2_material1.md: source URL1}, {node2_material2.md: source URL2}
...
File Naming Specification:
  • Use the format
    {serial number}_{brief description}.{extension}
  • The serial number corresponds to the line number in node-list.txt
  • The brief description reflects the material's topic

Scripts

Scripts

scripts/parallel_fetch.py

scripts/parallel_fetch.py

并行下载工具,用于加速多个URL的内容获取。
功能
  • 并发下载多个网页
  • 自动重试失败的请求
  • 进度显示与错误报告
Parallel downloading tool for accelerating content retrieval from multiple URLs.
Features:
  • Concurrent downloading of multiple webpages
  • Automatic retries for failed requests
  • Progress display and error reporting

scripts/validate_sources.py

scripts/validate_sources.py

验证资料完整性与可访问性。
功能
  • 检查已下载资料的完整性
  • 验证URL的可访问性
  • 生成资料质量报告
Verify material integrity and accessibility.
Features:
  • Check the integrity of downloaded materials
  • Verify URL accessibility
  • Generate material quality reports

Examples

Examples

示例:节点调研(默认配置)

Example: Node Research (Default Configuration)

用户配置:2个子代理,每节点2次搜索,每次3次fetch,保存3个资料
输入 (
node-list.txt
):
React Hooks入门
Docker容器化技术
搜索策略(严格限制):
节点1: React Hooks入门
- 搜索1: "React Hooks 入门教程"
  - Fetch: 官方文档、技术博客(最多3次)
  - 保存: 3个最相关的资料
- 搜索2: "React Hooks best practices"
  - Fetch: 最佳实践相关文章(最多3次)
  - 保存: 3个最相关的资料
输出 (
download.txt
):
React Hooks入门: {1_hooks_intro.md: https://react.dev/learn}, {1_hooks_guide.md: https://www.runoob.com/reactjs/react-hooks.html}, {1_hooks_best_practices.md: https://blog.logrocket.com/guide-to-react-hooks/}
Docker容器化技术: {2_docker_intro.md: https://docs.docker.com/get-started/}, {2_docker_tutorial.md: https://yeasy.gitbook.io/docker_practice/}
User Configuration: 2 sub-agents, 2 searches per node, 3 fetches per search, save 3 materials
Input (
node-list.txt
):
Introduction to React Hooks
Docker Containerization Technology
Search Strategy (strictly restricted):
Node 1: Introduction to React Hooks
- Search 1: "React Hooks introduction tutorial"
  - Fetch: Official documents, technical blogs (max 3 attempts)
  - Save: 3 most relevant materials
- Search 2: "React Hooks best practices"
  - Fetch: Articles related to best practices (max 3 attempts)
  - Save: 3 most relevant materials
Output (
download.txt
):
Introduction to React Hooks: {1_hooks_intro.md: https://react.dev/learn}, {1_hooks_guide.md: https://www.runoob.com/reactjs/react-hooks.html}, {1_hooks_best_practices.md: https://blog.logrocket.com/guide-to-react-hooks/}
Docker Containerization Technology: {2_docker_intro.md: https://docs.docker.com/get-started/}, {2_docker_tutorial.md: https://yeasy.gitbook.io/docker_practice/}

示例:快速收集(低配模式)

Example: Quick Collection (Low-Configuration Mode)

用户配置:1个子代理,每节点1次搜索,每次1次fetch,保存1个资料
适用场景:快速验证、资源受限、测试流程
特点
  • 最小化API调用
  • 快速完成收集
  • 资料基础但够用
User Configuration: 1 sub-agent, 1 search per node, 1 fetch per search, save 1 material
Applicable Scenarios: Quick verification, resource-constrained environments, process testing
Features:
  • Minimizes API calls
  • Completes collection quickly
  • Basic but sufficient materials

Materials目录结构

Materials Directory Structure

materials/
├── 1_hooks_intro.md
├── 1_hooks_guide.md
├── 1_hooks_best_practices.md
├── 2_docker_intro.md
├── 2_docker_tutorial.md
├── 3_microservices_patterns.md
└── 3_microservices_guide.md
materials/
├── 1_hooks_intro.md
├── 1_hooks_guide.md
├── 1_hooks_best_practices.md
├── 2_docker_intro.md
├── 2_docker_tutorial.md
├── 3_microservices_patterns.md
└── 3_microservices_guide.md

Troubleshooting

Troubleshooting

问题解决方案
某个节点找不到资料尝试不同关键词,扩大搜索范围
网页内容无法获取使用web_reader工具获取完整内容
资料质量不佳优先选择官方文档、权威来源
并行请求失败减少并发数,添加重试机制
资料重复去重并合并相似内容
ProblemSolution
No materials found for a nodeTry different keywords to expand search scope
Unable to retrieve webpage contentUse the web_reader tool to get complete content
Poor material qualityPrioritize official documents and authoritative sources
Parallel request failureReduce concurrency and add retry mechanism
Duplicate materialsDeduplicate and merge similar content

Quality Standards

Quality Standards

每个节点应收集:
  • 至少2-3个高质量资料来源
  • 涵盖不同角度(理论+实践)
  • 优先级排序:官方文档 > 权威教程 > 技术博客 > 个人笔记
  • 时间要求:优先选择近1-2年的资料(技术快速迭代领域)
Each node should collect:
  • At least 2-3 high-quality source materials
  • Cover different perspectives (theory + practice)
  • Priority ranking: Official documents > Authoritative tutorials > Technical blogs > Personal notes
  • Time requirement: Prioritize materials from the last 1-2 years (for fields with rapid technological iteration)