read-source
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWhen to Use
何时使用
Use to convert source documents into LLM-ready text. This is for source material — PDFs, Word docs, presentations, HTML pages, and Markdown files that contain data you need to extract.
witan read- PDF → plain text
- Word (.doc, .docx) → markdown
- PowerPoint (.ppt, .pptx) → markdown
- HTML → markdown
- Markdown (.md) → outline support via
--outline
This is not for reading spreadsheet data (.xlsx, .xls) — use spreadsheet-specific tools for that.
使用命令可将源文档转换为适合LLM处理的文本。该命令适用于源材料——包含你需要提取的数据的PDF、Word文档、演示文稿、HTML页面和Markdown文件。
witan read- PDF → 纯文本
- Word (.doc, .docx) → markdown
- PowerPoint (.ppt, .pptx) → markdown
- HTML → markdown
- Markdown (.md) → 通过参数支持大纲导出
--outline
该命令不适用于读取电子表格数据(.xlsx、.xls)——请使用专门的电子表格工具处理这类文件。
Setup
配置
Files are cached server-side by content hash so repeated operations skip re-upload. If is set (or is passed), files are processed but not stored.
WITAN_STATELESS=1--statelessThe CLI automatically applies per-attempt request timeouts and retries transient API failures (, , , , , , plus timeout/network errors). Non-retryable responses fail immediately.
4084295005025035044xx文件会按内容哈希在服务端缓存,因此重复操作无需重新上传。如果设置了(或传入参数),文件仅会被处理不会被存储。
WITAN_STATELESS=1--statelessCLI会自动为每次请求设置超时时间,并重试临时API故障(、、、、、,以及超时/网络错误)。不可重试的响应会立即报错终止。
4084295005025035044xxQuick Reference
快速参考
bash
undefinedbash
undefinedGet document structure first
先获取文档结构
witan read report.pdf --outline
witan read slides.pptx --outline
witan read report.pdf --outline
witan read slides.pptx --outline
Read specific sections
读取指定章节
witan read report.pdf --pages 1-5
witan read slides.pptx --slides 1-3
witan read notes.docx --offset 50 --limit 100
witan read report.pdf --pages 1-5
witan read slides.pptx --slides 1-3
witan read notes.docx --offset 50 --limit 100
Read from URLs
从URL读取
witan read https://example.com/report.pdf --outline
witan read https://example.com/data.csv
witan read https://example.com/report.pdf --outline
witan read https://example.com/data.csv
JSON output for automation
输出JSON用于自动化流程
witan read report.pdf --json
witan read report.pdf --outline --json
undefinedwitan read report.pdf --json
witan read report.pdf --outline --json
undefinedExit Codes
退出码
| Code | Meaning |
|---|---|
| Success |
| Error (bad arguments, network failure, unsupported format) |
| 代码 | 含义 |
|---|---|
| 成功 |
| 错误(参数错误、网络故障、不支持的格式) |
Navigation Strategy
导航策略
Go directly with , , or / when you know where to look. Use when you don't — it gives document structure to target the right section.
--pages--slides--offset--limit--outlinePDF workflow:
- → see chapter/section structure with page ranges
witan read report.pdf --outline - → read the section you need
witan read report.pdf --pages 12-15
PPTX workflow:
- → see slide titles
witan read deck.pptx --outline - → read specific slides
witan read deck.pptx --slides 5-8
Text/DOCX workflow:
- → see heading structure with line offsets
witan read notes.docx --outline - → read a section
witan read notes.docx --offset 120 --limit 50
当你知道内容位置时,可直接使用、或/参数。不知道内容位置时可使用参数——它会返回文档结构,帮你定位到正确的章节。
--pages--slides--offset--limit--outlinePDF工作流:
- → 查看带页码范围的章节结构
witan read report.pdf --outline - → 读取你需要的章节
witan read report.pdf --pages 12-15
PPTX工作流:
- → 查看幻灯片标题
witan read deck.pptx --outline - → 读取指定幻灯片
witan read deck.pptx --slides 5-8
文本/DOCX工作流:
- → 查看带行偏移量的标题结构
witan read notes.docx --outline - → 读取指定章节
witan read notes.docx --offset 120 --limit 50
Command Reference
命令参考
witan read <file-or-url> [flags]| Flag | Default | Description |
|---|---|---|
| — | PDF page range (e.g. |
| — | Presentation slide range (e.g. |
| | Start line (1-indexed) |
| | Maximum lines to return |
| | Show document structure instead of content |
| | Output full JSON response |
witan read <file-or-url> [flags]| 参数 | 默认值 | 描述 |
|---|---|---|
| — | PDF页码范围(例如 |
| — | 演示文稿幻灯片范围(例如 |
| | 起始行(从1开始计数) |
| | 返回的最大行数 |
| | 显示文档结构而非内容 |
| | 输出完整JSON响应 |
Pagination Limits
分页限制
| Constraint | Value |
|---|---|
| Max PDF pages per read | 10 |
| Max PPTX slides per read | 10 |
| Default line limit | 2000 |
| Max file size | 25 MB |
| 限制项 | 数值 |
|---|---|
| 单次读取最大PDF页数 | 10 |
| 单次读取最大PPTX幻灯片数 | 10 |
| 默认行数限制 | 2000 |
| 最大文件大小 | 25 MB |
Pipeline: Source → Spreadsheet
工作流:源文件 → 电子表格
The typical flow for reading source material and populating a spreadsheet:
- Explore — to understand structure
witan read source.pdf --outline - Read — to get the data
witan read source.pdf --pages 3-8 - Parse — extract values from the text (LLM or regex)
- Write — to populate the spreadsheet
witan xlsx exec model.xlsx --input-json '...'
读取源材料并填充电子表格的典型流程:
- 探索 — 执行了解文档结构
witan read source.pdf --outline - 读取 — 执行获取所需数据
witan read source.pdf --pages 3-8 - 解析 — 从文本中提取数值(通过LLM或正则表达式)
- 写入 — 执行填充电子表格
witan xlsx exec model.xlsx --input-json '...'
Output Format
输出格式
Content mode (default): line-numbered text to stdout, metadata to stderr.
1 Revenue Summary
2
3 Q1: $1,250,000
4 Q2: $1,380,000
text/plain [15 pages, 10 read, 847 lines total, showing 1–847]Outline mode (): indented structure to stdout.
--outlineIntroduction [pages 1-2]
Background [pages 1-1]
Methodology [pages 2-2]
Results [pages 3-8]
Financial Summary [pages 3-5]
Projections [pages 6-8]
Appendix [pages 9-15]
[15 pages]内容模式(默认):带行号的文本输出到标准输出,元数据输出到标准错误。
1 Revenue Summary
2
3 Q1: $1,250,000
4 Q2: $1,380,000
text/plain [15 pages, 10 read, 847 lines total, showing 1–847]大纲模式():缩进格式的结构输出到标准输出。
--outlineIntroduction [pages 1-2]
Background [pages 1-1]
Methodology [pages 2-2]
Results [pages 3-8]
Financial Summary [pages 3-5]
Projections [pages 6-8]
Appendix [pages 9-15]
[15 pages]Error Guide
错误指南
| Error | Fix |
|---|---|
| Check file path exists and is readable |
| Check the URL is accessible |
| File exceeds 25 MB limit |
| Set Content-Type header (API only) |
| Empty outline | Document has no bookmarks/headings; use offset/limit to navigate |
| Truncated text | Use |
| 错误 | 解决方案 |
|---|---|
| 检查文件路径是否存在且可读 |
| 检查URL是否可访问 |
| 文件超过25 MB限制 |
| 设置Content-Type请求头(仅API调用时) |
| 空大纲 | 文档没有书签/标题,使用offset/limit参数导航 |
| 文本被截断 | 使用 |