local-llm-provider

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Local LLM Provider

本地LLM提供商

Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.
连接本地LLM端点(Ollama、llama.cpp、vLLM),并支持自动降级到云提供商。该技能可让Agent利用本地GPU/CPU进行推理,同时通过智能降级机制保障可靠性。

When to Use

适用场景

  • Running LLM inference locally for privacy (data never leaves your machine)
  • Using models not available via cloud APIs (e.g., fine-tuned models, Llama variants)
  • Reducing API costs for high-volume tasks
  • Working offline or with intermittent connectivity
  • Need low-latency responses for interactive tasks
  • 出于隐私考虑在本地运行LLM推理(数据绝不会离开你的设备)
  • 使用云API未提供的模型(例如微调模型、Llama变体)
  • 降低高量任务的API成本
  • 在离线或网络不稳定的环境下工作
  • 需要低延迟响应的交互式任务

Setup

配置步骤

No additional setup required if Ollama is already running. Otherwise:
若Ollama已运行,则无需额外配置。否则请按以下步骤操作:

Ollama Setup

Ollama配置

bash
undefined
bash
undefined

Install Ollama

安装Ollama

Pull a model

拉取模型

ollama pull llama3.2
ollama pull llama3.2

Start the server (default: http://localhost:11434)

启动服务器(默认地址:http://localhost:11434)

ollama serve
undefined
ollama serve
undefined

llama.cpp Server Setup

llama.cpp服务器配置

bash
undefined
bash
undefined

Build llama-server

构建llama-server

make llama-server
make llama-server

Start the server

启动服务器

llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080
undefined
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080
undefined

vLLM Server Setup

vLLM服务器配置

bash
undefined
bash
undefined

Install vLLM

安装vLLM

pip install vllm
pip install vllm

Start the server

启动服务器

vllm serve meta-llama/Llama-3.1-8B-Instruct
undefined
vllm serve meta-llama/Llama-3.1-8B-Instruct
undefined

Usage

使用方法

Query a local model

查询本地模型

bash
node /job/.pi/skills/local-llm-provider/query.js "What is 2+2?" --model llama3.2
bash
node /job/.pi/skills/local-llm-provider/query.js "2+2等于多少?" --model llama3.2

Query with custom parameters

使用自定义参数查询

bash
node /job/.pi/skills/local-llm-provider/query.js "Explain quantum computing" --model mixtral --temp 0.8 --max-tokens 500
bash
node /job/.pi/skills/local-llm-provider/query.js "解释量子计算" --model mixtral --temp 0.8 --max-tokens 500

List available models

列出可用模型

bash
node /job/.pi/skills/local-llm-provider/list-models.js
bash
node /job/.pi/skills/local-llm-provider/list-models.js

Check server health

检查服务器健康状态

bash
node /job/.pi/skills/local-llm-provider/health.js
bash
node /job/.pi/skills/local-llm-provider/health.js

Stream responses

流式响应

bash
node /job/.pi/skills/local-llm-provider/query.js "Tell me a story" --stream
bash
node /job/.pi/skills/local-llm-provider/query.js "给我讲个故事" --stream

Configuration

持久化配置

Create a
config.json
in the skill directory for persistent settings:
json
{
  "providers": [
    {
      "name": "ollama",
      "url": "http://localhost:11434",
      "enabled": true,
      "fallback_order": 1
    },
    {
      "name": "llamacpp",
      "url": "http://localhost:8080/v1",
      "enabled": false,
      "fallback_order": 2
    },
    {
      "name": "vllm",
      "url": "http://localhost:8000/v1",
      "enabled": false,
      "fallback_order": 3
    }
  ],
  "default_model": "llama3.2",
  "fallback_to_cloud": true,
  "cloud_provider": "anthropic",
  "timeout_ms": 120000
}
在技能目录下创建
config.json
文件以设置持久化参数:
json
{
  "providers": [
    {
      "name": "ollama",
      "url": "http://localhost:11434",
      "enabled": true,
      "fallback_order": 1
    },
    {
      "name": "llamacpp",
      "url": "http://localhost:8080/v1",
      "enabled": false,
      "fallback_order": 2
    },
    {
      "name": "vllm",
      "url": "http://localhost:8000/v1",
      "enabled": false,
      "fallback_order": 3
    }
  ],
  "default_model": "llama3.2",
  "fallback_to_cloud": true,
  "cloud_provider": "anthropic",
  "timeout_ms": 120000
}

Provider Fallback

提供商降级机制

The skill implements intelligent fallback:
  1. Primary: Try local Ollama first
  2. Secondary: Try llama.cpp server
  3. Tertiary: Try vLLM server
  4. Fallback: Use cloud provider (if enabled)
Each provider failure triggers automatic retry with the next available provider.
该技能实现了智能降级逻辑:
  1. 首选:优先尝试本地Ollama
  2. 次选:尝试llama.cpp服务器
  3. 第三选择:尝试vLLM服务器
  4. 最终降级:使用云提供商(若已启用)
每当某一提供商失败时,会自动重试下一个可用的提供商。

Supported Models

支持的模型

Ollama

Ollama

  • llama3.2, llama3.1, llama3
  • mistral, mixtral
  • qwen2.5, qwen2
  • phi3, phi4
  • gemma2, gemma
  • codellama
  • and many more
  • llama3.2, llama3.1, llama3
  • mistral, mixtral
  • qwen2.5, qwen2
  • phi3, phi4
  • gemma2, gemma
  • codellama
  • 以及更多其他模型

llama.cpp

llama.cpp

  • Any GGUF format model
  • Mistral variants
  • Llama variants
  • Qwen variants
  • 任何GGUF格式的模型
  • Mistral变体
  • Llama变体
  • Qwen变体

vLLM

vLLM

  • Llama 3.1, 3.0
  • Mistral
  • Qwen
  • Any HuggingFace model
  • Llama 3.1, 3.0
  • Mistral
  • Qwen
  • 任何HuggingFace模型

API Integration

API集成

As a library

作为库使用

javascript
const { LocalLLMProvider } = require('./provider.js');

const provider = new LocalLLMProvider({
  providers: [
    { name: 'ollama', url: 'http://localhost:11434', enabled: true },
    { name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false }
  ],
  default_model: 'llama3.2',
  fallback_to_cloud: true
});

const response = await provider.complete('Hello, how are you?');
console.log(response);
javascript
const { LocalLLMProvider } = require('./provider.js');

const provider = new LocalLLMProvider({
  providers: [
    { name: 'ollama', url: 'http://localhost:11434', enabled: true },
    { name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false }
  ],
  default_model: 'llama3.2',
  fallback_to_cloud: true
});

const response = await provider.complete('你好,最近怎么样?');
console.log(response);

Output Format

输出格式

The query returns JSON:
json
{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "I'm doing well, thank you for asking!",
  "tokens": 42,
  "duration_ms": 1500,
  "done": true
}
When streaming:
json
{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "I",
  "tokens": 1,
  "done": false
}
On fallback failure:
json
{
  "success": false,
  "error": "All providers failed",
  "providers_tried": ["ollama", "llamacpp"],
  "last_error": "Connection refused"
}
查询会返回JSON格式结果:
json
{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "我很好,谢谢你的关心!",
  "tokens": 42,
  "duration_ms": 1500,
  "done": true
}
流式响应时:
json
{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "我",
  "tokens": 1,
  "done": false
}
所有提供商都失败时:
json
{
  "success": false,
  "error": "所有提供商均失败",
  "providers_tried": ["ollama", "llamacpp"],
  "last_error": "连接被拒绝"
}

Environment Variables

环境变量

VariableDescriptionDefault
OLLAMA_BASE_URL
Ollama server URLhttp://localhost:11434
LLAMACPP_BASE_URL
llama.cpp server URLhttp://localhost:8080/v1
VLLM_BASE_URL
vLLM server URLhttp://localhost:8000/v1
LOCAL_LLM_DEFAULT_MODEL
Default model to usellama3.2
变量名描述默认值
OLLAMA_BASE_URL
Ollama服务器地址http://localhost:11434
LLAMACPP_BASE_URL
llama.cpp服务器地址http://localhost:8080/v1
VLLM_BASE_URL
vLLM服务器地址http://localhost:8000/v1
LOCAL_LLM_DEFAULT_MODEL
默认使用的模型llama3.2

Limitations

局限性

  • Requires local server to be running
  • Model quality depends on local hardware
  • Not all models support all features (e.g., function calling)
  • Some providers have different API formats
  • 需要本地服务器处于运行状态
  • 模型质量取决于本地硬件配置
  • 并非所有模型都支持全部功能(例如函数调用)
  • 不同提供商的API格式存在差异

Tips

使用技巧

  1. For best performance: Use Ollama with GPU acceleration
  2. For variety: Pull multiple models (
    ollama pull mixtral
    )
  3. For privacy: Always use local providers first
  4. For reliability: Keep cloud fallback enabled for critical tasks
  5. For speed: Use smaller models (7B) for simple tasks, larger (70B) for complex reasoning
  1. 最佳性能:使用带GPU加速的Ollama
  2. 模型多样性:拉取多个模型(
    ollama pull mixtral
  3. 隐私保障:优先使用本地提供商
  4. 可靠性保障:针对关键任务保持云降级功能启用
  5. 速度优化:简单任务使用小模型(7B参数),复杂推理使用大模型(70B参数)