semantic-search

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Semantic Search Skill

语义搜索Skill

Search through files and directories for content using keyword matching and basic semantic analysis.
使用关键词匹配和基础语义分析来搜索文件和目录中的内容。

When to Use

适用场景

USE this skill when:
  • Finding code that implements a feature
  • Searching documentation for topics
  • Locating files by their content
  • Finding similar code patterns
  • Researching codebase structure
当以下情况时使用该Skill:
  • 查找实现某个功能的代码
  • 搜索文档中的特定主题
  • 根据内容定位文件
  • 查找相似的代码模式
  • 研究代码库结构

When NOT to Use

不适用场景

DON'T use this skill when:
  • Searching binary files → use file tools
  • Exact regex patterns → use grep
  • Searching very large repos (>100k files) → use indexed search
以下情况请勿使用该Skill:
  • 搜索二进制文件 → 使用文件工具
  • 精确正则表达式模式 → 使用grep
  • 搜索超大型仓库(>100k文件)→ 使用索引搜索

Installation

安装

bash
cd /job
npm install natural compromise
bash
cd /job
npm install natural compromise

Features

功能特性

  • Keyword Search: Simple text matching across files
  • Stemming: Matches word variations (run, running, ran)
  • TF-IDF Scoring: Ranks results by relevance
  • File Filtering: Filter by extension, path patterns
  • Context: Shows surrounding lines for each match
  • 关键词搜索: 在文件中进行简单文本匹配
  • 词干提取: 匹配单词变体(如run、running、ran)
  • TF-IDF评分: 根据相关性对结果排序
  • 文件过滤: 按扩展名、路径模式过滤
  • 上下文展示: 显示每个匹配结果的周边代码行

Usage

使用方法

Basic Search

基础搜索

javascript
const { searchFiles } = require('./semantic-search');

const results = await searchFiles('.', {
  query: 'authentication middleware',
  extensions: ['.js', '.ts'],
  maxResults: 20
});

console.log(results);
javascript
const { searchFiles } = require('./semantic-search');

const results = await searchFiles('.', {
  query: 'authentication middleware',
  extensions: ['.js', '.ts'],
  maxResults: 20
});

console.log(results);

Advanced Search

高级搜索

javascript
const results = await searchFiles('/path/to/code', {
  query: 'error handling database',
  excludeDirs: ['node_modules', 'dist', '.git'],
  extensions: ['.js', '.ts', '.py'],
  contextLines: 3,
  maxResults: 50,
  minScore: 0.3
});
javascript
const results = await searchFiles('/path/to/code', {
  query: 'error handling database',
  excludeDirs: ['node_modules', 'dist', '.git'],
  extensions: ['.js', '.ts', '.py'],
  contextLines: 3,
  maxResults: 50,
  minScore: 0.3
});

Node.js Implementation

Node.js实现

javascript
const fs = require('fs');
const path = require('path');
const natural = require('natural');

class SemanticSearcher {
  constructor(options = {}) {
    this.stemmer = natural.PorterStemmer;
    this.tokenizer = new natural.WordTokenizer();
    this.maxFileSize = options.maxFileSize || 1024 * 1024; // 1MB
    this.excludeDirs = options.excludeDirs || [
      'node_modules', 'dist', 'build', '.git', 'vendor',
      '__pycache__', '.next', '.nuxt'
    ];
  }

  tokenize(text) {
    return this.tokenizer.tokenize(text.toLowerCase())
      .map(token => this.stemmer.stem(token));
  }

  calculateTF(tokens) {
    const tf = {};
    tokens.forEach(token => {
      tf[token] = (tf[token] || 0) + 1;
    });
    const maxFreq = Math.max(...Object.values(tf));
    Object.keys(tf).forEach(key => {
      tf[key] /= maxFreq;
    });
    return tf;
  }

  scoreDocument(queryTokens, docTokens) {
    const querySet = new Set(queryTokens);
    let score = 0;
    docTokens.forEach(token => {
      if (querySet.has(token)) score++;
    });
    return score / Math.max(docTokens.length, 1);
  }

  async searchFiles(rootDir, query, options = {}) {
    const queryTokens = this.tokenize(query);
    const results = [];
    const files = await this.walkDirectory(rootDir, options);

    for (const file of files) {
      try {
        const content = await fs.promises.readFile(file, 'utf-8');
        const tokens = this.tokenize(content);
        const score = this.scoreDocument(queryTokens, tokens);

        if (score > (options.minScore || 0.1)) {
          const lines = content.split('\n');
          const matchLines = this.findMatchingLines(lines, queryTokens, options.contextLines || 2);
          
          results.push({
            file: path.relative(rootDir, file),
            score: score.toFixed(3),
            matches: matchLines,
            totalLines: lines.length
          });
        }
      } catch (e) {
        // Skip unreadable files
      }
    }

    return results.sort((a, b) => parseFloat(b.score) - parseFloat(a.score))
      .slice(0, options.maxResults || 20);
  }

  async walkDirectory(dir, options = {}) {
    const files = [];
    const extensions = options.extensions || null;
    
    async function walk(currentDir) {
      const entries = await fs.promises.readdir(currentDir, { withFileTypes: true });
      
      for (const entry of entries) {
        if (entry.isDirectory()) {
          if (!this.excludeDirs.includes(entry.name)) {
            await walk(path.join(currentDir, entry.name));
          }
        } else if (entry.isFile()) {
          if (!extensions || extensions.some(ext => entry.name.endsWith(ext))) {
            const filePath = path.join(currentDir, entry.name);
            const stats = await fs.promises.stat(filePath);
            if (stats.size <= this.maxFileSize) {
              files.push(filePath);
            }
          }
        }
      }
    }
    
    await walk.call(this, dir);
    return files;
  }

  findMatchingLines(lines, queryTokens, contextLines) {
    const matches = [];
    
    lines.forEach((line, index) => {
      const lineTokens = this.tokenize(line);
      const matchCount = lineTokens.filter(t => queryTokens.includes(t)).length;
      
      if (matchCount > 0) {
        const start = Math.max(0, index - contextLines);
        const end = Math.min(lines.length, index + contextLines + 1);
        
        matches.push({
          lineNumber: index + 1,
          content: line.trim(),
          context: lines.slice(start, end).join('\n'),
          matchScore: matchCount
        });
      }
    });
    
    return matches.slice(0, 10);
  }
}

// Usage
const searcher = new SemanticSearcher();
const results = await searcher.searchFiles('.', 'authentication', {
  extensions: ['.js', '.ts'],
  maxResults: 10
});

console.log(JSON.stringify(results, null, 2));
javascript
const fs = require('fs');
const path = require('path');
const natural = require('natural');

class SemanticSearcher {
  constructor(options = {}) {
    this.stemmer = natural.PorterStemmer;
    this.tokenizer = new natural.WordTokenizer();
    this.maxFileSize = options.maxFileSize || 1024 * 1024; // 1MB
    this.excludeDirs = options.excludeDirs || [
      'node_modules', 'dist', 'build', '.git', 'vendor',
      '__pycache__', '.next', '.nuxt'
    ];
  }

  tokenize(text) {
    return this.tokenizer.tokenize(text.toLowerCase())
      .map(token => this.stemmer.stem(token));
  }

  calculateTF(tokens) {
    const tf = {};
    tokens.forEach(token => {
      tf[token] = (tf[token] || 0) + 1;
    });
    const maxFreq = Math.max(...Object.values(tf));
    Object.keys(tf).forEach(key => {
      tf[key] /= maxFreq;
    });
    return tf;
  }

  scoreDocument(queryTokens, docTokens) {
    const querySet = new Set(queryTokens);
    let score = 0;
    docTokens.forEach(token => {
      if (querySet.has(token)) score++;
    });
    return score / Math.max(docTokens.length, 1);
  }

  async searchFiles(rootDir, query, options = {}) {
    const queryTokens = this.tokenize(query);
    const results = [];
    const files = await this.walkDirectory(rootDir, options);

    for (const file of files) {
      try {
        const content = await fs.promises.readFile(file, 'utf-8');
        const tokens = this.tokenize(content);
        const score = this.scoreDocument(queryTokens, tokens);

        if (score > (options.minScore || 0.1)) {
          const lines = content.split('\n');
          const matchLines = this.findMatchingLines(lines, queryTokens, options.contextLines || 2);
          
          results.push({
            file: path.relative(rootDir, file),
            score: score.toFixed(3),
            matches: matchLines,
            totalLines: lines.length
          });
        }
      } catch (e) {
        // Skip unreadable files
      }
    }

    return results.sort((a, b) => parseFloat(b.score) - parseFloat(a.score))
      .slice(0, options.maxResults || 20);
  }

  async walkDirectory(dir, options = {}) {
    const files = [];
    const extensions = options.extensions || null;
    
    async function walk(currentDir) {
      const entries = await fs.promises.readdir(currentDir, { withFileTypes: true });
      
      for (const entry of entries) {
        if (entry.isDirectory()) {
          if (!this.excludeDirs.includes(entry.name)) {
            await walk(path.join(currentDir, entry.name));
          }
        } else if (entry.isFile()) {
          if (!extensions || extensions.some(ext => entry.name.endsWith(ext))) {
            const filePath = path.join(currentDir, entry.name);
            const stats = await fs.promises.stat(filePath);
            if (stats.size <= this.maxFileSize) {
              files.push(filePath);
            }
          }
        }
      }
    }
    
    await walk.call(this, dir);
    return files;
  }

  findMatchingLines(lines, queryTokens, contextLines) {
    const matches = [];
    
    lines.forEach((line, index) => {
      const lineTokens = this.tokenize(line);
      const matchCount = lineTokens.filter(t => queryTokens.includes(t)).length;
      
      if (matchCount > 0) {
        const start = Math.max(0, index - contextLines);
        const end = Math.min(lines.length, index + contextLines + 1);
        
        matches.push({
          lineNumber: index + 1,
          content: line.trim(),
          context: lines.slice(start, end).join('\n'),
          matchScore: matchCount
        });
      }
    });
    
    return matches.slice(0, 10);
  }
}

// Usage
const searcher = new SemanticSearcher();
const results = await searcher.searchFiles('.', 'authentication', {
  extensions: ['.js', '.ts'],
  maxResults: 10
});

console.log(JSON.stringify(results, null, 2));

Command Line Usage

命令行使用

bash
undefined
bash
undefined

Search for authentication code

搜索认证相关代码

node index.js search "auth middleware" --ext .js,.ts --max 10
node index.js search "auth middleware" --ext .js,.ts --max 10

Search with context

带上下文搜索

node index.js search "error handling" --context 5
node index.js search "error handling" --context 5

Search specific directory

搜索指定目录

node index.js search "database" --dir src/
undefined
node index.js search "database" --dir src/
undefined

Output Format

输出格式

json
{
  "query": "authentication middleware",
  "totalMatches": 5,
  "results": [
    {
      "file": "src/middleware/auth.js",
      "score": "0.847",
      "matches": [
        {
          "lineNumber": 42,
          "content": "function authenticateUser(token) {",
          "context": "...",
          "matchScore": 3
        }
      ]
    }
  ]
}
json
{
  "query": "authentication middleware",
  "totalMatches": 5,
  "results": [
    {
      "file": "src/middleware/auth.js",
      "score": "0.847",
      "matches": [
        {
          "lineNumber": 42,
          "content": "function authenticateUser(token) {",
          "context": "...",
          "matchScore": 3
        }
      ]
    }
  ]
}

Quick Tips

快速提示

  • Use specific terms: "JWT validation" not just "auth"
  • Include type hints: ".js" files often have different patterns
  • Multiple words improve accuracy
  • Use camelCase terms for code search
  • 使用特定术语:比如用"JWT验证"而不只是"auth"
  • 包含类型提示:.js文件通常有不同的模式
  • 多词查询提高准确性
  • 搜索代码时使用驼峰式术语

Notes

注意事项

  • Searches text files only
  • Case-insensitive matching
  • Stemming improves recall
  • Scores range from 0.0 to 1.0
  • 仅搜索文本文件
  • 不区分大小写匹配
  • 词干提取提高召回率
  • 评分范围为0.0到1.0