indexion-segment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

indexion segment

indexion segment

Split text into contextual segments using divergence-based, TF-IDF, or punctuation strategies.
使用基于差异、TF-IDF或标点符号的策略将文本分割为上下文相关的片段。

When to Use

适用场景

  • User needs to chunk text for RAG or embedding pipelines
  • User wants to split a document into meaningful sections
  • User asks to segment text for processing
  • Preparing text for similarity analysis at sub-document level
  • 用户需要为RAG或嵌入流水线分割文本
  • 用户希望将文档拆分为有意义的章节
  • 用户要求分割文本以进行处理
  • 为子文档级别的相似度分析准备文本

Usage

使用方法

bash
undefined
bash
undefined

Default window divergence strategy

默认窗口差异策略

indexion segment <input-file> <output-dir>
indexion segment <input-file> <output-dir>

TF-IDF based segmentation

基于TF-IDF的分割

indexion segment --strategy=tfidf <input-file> <output-dir>
indexion segment --strategy=tfidf <input-file> <output-dir>

Punctuation-based segmentation

基于标点符号的分割

indexion segment --strategy=punctuation <input-file> <output-dir>
indexion segment --strategy=punctuation <input-file> <output-dir>

Custom segment sizes

自定义片段大小

indexion segment --min-size=200 --max-size=3000 --target-size=800 document.txt output/
indexion segment --min-size=200 --max-size=3000 --target-size=800 document.txt output/

Custom divergence threshold

自定义差异阈值

indexion segment --threshold=0.5 document.txt output/
indexion segment --threshold=0.5 document.txt output/

Adaptive threshold mode (default)

自适应阈值模式(默认)

indexion segment --adaptive document.txt output/
indexion segment --adaptive document.txt output/

Hybrid NCD+TF-IDF mode

混合NCD+TF-IDF模式

indexion segment --hybrid --ncd-weight=0.6 --tfidf-weight=0.4 document.txt output/
indexion segment --hybrid --ncd-weight=0.6 --tfidf-weight=0.4 document.txt output/

Custom window size

自定义窗口大小

indexion segment --window-size=5 document.txt output/
indexion segment --window-size=5 document.txt output/

Custom output prefix

自定义输出前缀

indexion segment --prefix=chunk document.txt output/
undefined
indexion segment --prefix=chunk document.txt output/
undefined

Options

选项

OptionDefaultDescription
--strategy=NAME
windowStrategy: window, tfidf, punctuation
--min-size=INT
100Minimum segment characters
--max-size=INT
2000Maximum segment characters
--target-size=INT
500Target segment characters
--threshold=FLOAT
0.42Divergence threshold
--window-size=INT
3Window size
--adaptive
trueAdaptive threshold mode
--hybrid
falseNCD+TF-IDF hybrid mode
--ncd-weight=FLOAT
0.5NCD weight in hybrid mode
--tfidf-weight=FLOAT
0.5TF-IDF weight in hybrid mode
--prefix=NAME
segmentOutput file prefix
选项默认值描述
--strategy=NAME
window策略:window、tfidf、punctuation
--min-size=INT
100片段最小字符数
--max-size=INT
2000片段最大字符数
--target-size=INT
500片段目标字符数
--threshold=FLOAT
0.42差异阈值
--window-size=INT
3窗口大小
--adaptive
true自适应阈值模式
--hybrid
falseNCD+TF-IDF混合模式
--ncd-weight=FLOAT
0.5混合模式下的NCD权重
--tfidf-weight=FLOAT
0.5混合模式下的TF-IDF权重
--prefix=NAME
segment输出文件前缀

Strategies

策略

StrategyDescription
window
(default)
Sliding window divergence detection
tfidf
TF-IDF based topic change detection
punctuation
Punctuation/sentence boundary based
策略描述
window
(默认)
滑动窗口差异检测
tfidf
基于TF-IDF的主题变化检测
punctuation
基于标点符号/句子边界的分割

Workflow

工作流程

  1. Run
    indexion segment <input-file> <output-dir>
    to split text with defaults
  2. Adjust
    --threshold
    and
    --target-size
    to tune segmentation granularity
  3. Use
    --hybrid
    mode for better accuracy on mixed-content documents
  1. 运行
    indexion segment <input-file> <output-dir>
    使用默认设置分割文本
  2. 调整
    --threshold
    --target-size
    来调整分割粒度
  3. 对混合内容文档使用
    --hybrid
    模式以获得更高的准确性