indexion-segment

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

indexion segment

Split text into contextual segments using divergence-based, TF-IDF, or punctuation strategies.

使用基于差异、TF-IDF或标点符号的策略将文本分割为上下文相关的片段。

When to Use

适用场景

User needs to chunk text for RAG or embedding pipelines
User wants to split a document into meaningful sections
User asks to segment text for processing
Preparing text for similarity analysis at sub-document level

用户需要为RAG或嵌入流水线分割文本
用户希望将文档拆分为有意义的章节
用户要求分割文本以进行处理
为子文档级别的相似度分析准备文本

Usage

使用方法

bash

undefined

bash

undefined

Default window divergence strategy

默认窗口差异策略

indexion segment <input-file> <output-dir>

TF-IDF based segmentation

基于TF-IDF的分割

indexion segment --strategy=tfidf <input-file> <output-dir>

Punctuation-based segmentation

基于标点符号的分割

indexion segment --strategy=punctuation <input-file> <output-dir>

Custom segment sizes

自定义片段大小

indexion segment --min-size=200 --max-size=3000 --target-size=800 document.txt output/

Custom divergence threshold

自定义差异阈值

indexion segment --threshold=0.5 document.txt output/

Adaptive threshold mode (default)

自适应阈值模式（默认）

indexion segment --adaptive document.txt output/

Hybrid NCD+TF-IDF mode

混合NCD+TF-IDF模式

indexion segment --hybrid --ncd-weight=0.6 --tfidf-weight=0.4 document.txt output/

Custom window size

自定义窗口大小

indexion segment --window-size=5 document.txt output/

Custom output prefix

自定义输出前缀

indexion segment --prefix=chunk document.txt output/

undefined

indexion segment --prefix=chunk document.txt output/

undefined

Options

选项

Option	Default	Description
`--strategy=NAME`	window	Strategy: window, tfidf, punctuation
`--min-size=INT`	100	Minimum segment characters
`--max-size=INT`	2000	Maximum segment characters
`--target-size=INT`	500	Target segment characters
`--threshold=FLOAT`	0.42	Divergence threshold
`--window-size=INT`	3	Window size
`--adaptive`	true	Adaptive threshold mode
`--hybrid`	false	NCD+TF-IDF hybrid mode
`--ncd-weight=FLOAT`	0.5	NCD weight in hybrid mode
`--tfidf-weight=FLOAT`	0.5	TF-IDF weight in hybrid mode
`--prefix=NAME`	segment	Output file prefix

选项	默认值	描述
`--strategy=NAME`	window	策略：window、tfidf、punctuation
`--min-size=INT`	100	片段最小字符数
`--max-size=INT`	2000	片段最大字符数
`--target-size=INT`	500	片段目标字符数
`--threshold=FLOAT`	0.42	差异阈值
`--window-size=INT`	3	窗口大小
`--adaptive`	true	自适应阈值模式
`--hybrid`	false	NCD+TF-IDF混合模式
`--ncd-weight=FLOAT`	0.5	混合模式下的NCD权重
`--tfidf-weight=FLOAT`	0.5	混合模式下的TF-IDF权重
`--prefix=NAME`	segment	输出文件前缀

Strategies

策略

Strategy	Description
`window` (default)	Sliding window divergence detection
`tfidf`	TF-IDF based topic change detection
`punctuation`	Punctuation/sentence boundary based

策略	描述
`window` （默认）	滑动窗口差异检测
`tfidf`	基于TF-IDF的主题变化检测
`punctuation`	基于标点符号/句子边界的分割

Workflow

工作流程

Run

indexion segment <input-file> <output-dir>

to split text with defaults

Adjust
```
--threshold
```
and
```
--target-size
```
to tune segmentation granularity
Use
```
--hybrid
```
mode for better accuracy on mixed-content documents

运行

indexion segment <input-file> <output-dir>

使用默认设置分割文本

调整
```
--threshold
```
和
```
--target-size
```
来调整分割粒度
对混合内容文档使用
```
--hybrid
```
模式以获得更高的准确性