indexion-segment
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseindexion segment
indexion segment
Split text into contextual segments using divergence-based, TF-IDF, or punctuation strategies.
使用基于差异、TF-IDF或标点符号的策略将文本分割为上下文相关的片段。
When to Use
适用场景
- User needs to chunk text for RAG or embedding pipelines
- User wants to split a document into meaningful sections
- User asks to segment text for processing
- Preparing text for similarity analysis at sub-document level
- 用户需要为RAG或嵌入流水线分割文本
- 用户希望将文档拆分为有意义的章节
- 用户要求分割文本以进行处理
- 为子文档级别的相似度分析准备文本
Usage
使用方法
bash
undefinedbash
undefinedDefault window divergence strategy
默认窗口差异策略
indexion segment <input-file> <output-dir>
indexion segment <input-file> <output-dir>
TF-IDF based segmentation
基于TF-IDF的分割
indexion segment --strategy=tfidf <input-file> <output-dir>
indexion segment --strategy=tfidf <input-file> <output-dir>
Punctuation-based segmentation
基于标点符号的分割
indexion segment --strategy=punctuation <input-file> <output-dir>
indexion segment --strategy=punctuation <input-file> <output-dir>
Custom segment sizes
自定义片段大小
indexion segment --min-size=200 --max-size=3000 --target-size=800 document.txt output/
indexion segment --min-size=200 --max-size=3000 --target-size=800 document.txt output/
Custom divergence threshold
自定义差异阈值
indexion segment --threshold=0.5 document.txt output/
indexion segment --threshold=0.5 document.txt output/
Adaptive threshold mode (default)
自适应阈值模式(默认)
indexion segment --adaptive document.txt output/
indexion segment --adaptive document.txt output/
Hybrid NCD+TF-IDF mode
混合NCD+TF-IDF模式
indexion segment --hybrid --ncd-weight=0.6 --tfidf-weight=0.4 document.txt output/
indexion segment --hybrid --ncd-weight=0.6 --tfidf-weight=0.4 document.txt output/
Custom window size
自定义窗口大小
indexion segment --window-size=5 document.txt output/
indexion segment --window-size=5 document.txt output/
Custom output prefix
自定义输出前缀
indexion segment --prefix=chunk document.txt output/
undefinedindexion segment --prefix=chunk document.txt output/
undefinedOptions
选项
| Option | Default | Description |
|---|---|---|
| window | Strategy: window, tfidf, punctuation |
| 100 | Minimum segment characters |
| 2000 | Maximum segment characters |
| 500 | Target segment characters |
| 0.42 | Divergence threshold |
| 3 | Window size |
| true | Adaptive threshold mode |
| false | NCD+TF-IDF hybrid mode |
| 0.5 | NCD weight in hybrid mode |
| 0.5 | TF-IDF weight in hybrid mode |
| segment | Output file prefix |
| 选项 | 默认值 | 描述 |
|---|---|---|
| window | 策略:window、tfidf、punctuation |
| 100 | 片段最小字符数 |
| 2000 | 片段最大字符数 |
| 500 | 片段目标字符数 |
| 0.42 | 差异阈值 |
| 3 | 窗口大小 |
| true | 自适应阈值模式 |
| false | NCD+TF-IDF混合模式 |
| 0.5 | 混合模式下的NCD权重 |
| 0.5 | 混合模式下的TF-IDF权重 |
| segment | 输出文件前缀 |
Strategies
策略
| Strategy | Description |
|---|---|
| Sliding window divergence detection |
| TF-IDF based topic change detection |
| Punctuation/sentence boundary based |
| 策略 | 描述 |
|---|---|
| 滑动窗口差异检测 |
| 基于TF-IDF的主题变化检测 |
| 基于标点符号/句子边界的分割 |
Workflow
工作流程
- Run to split text with defaults
indexion segment <input-file> <output-dir> - Adjust and
--thresholdto tune segmentation granularity--target-size - Use mode for better accuracy on mixed-content documents
--hybrid
- 运行 使用默认设置分割文本
indexion segment <input-file> <output-dir> - 调整 和
--threshold来调整分割粒度--target-size - 对混合内容文档使用 模式以获得更高的准确性
--hybrid