local-llm-provider

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Local LLM Provider

本地LLM提供商

Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.

连接本地LLM端点（Ollama、llama.cpp、vLLM），并支持自动降级到云提供商。该技能可让Agent利用本地GPU/CPU进行推理，同时通过智能降级机制保障可靠性。

When to Use

适用场景

Running LLM inference locally for privacy (data never leaves your machine)
Using models not available via cloud APIs (e.g., fine-tuned models, Llama variants)
Reducing API costs for high-volume tasks
Working offline or with intermittent connectivity
Need low-latency responses for interactive tasks

出于隐私考虑在本地运行LLM推理（数据绝不会离开你的设备）
使用云API未提供的模型（例如微调模型、Llama变体）
降低高量任务的API成本
在离线或网络不稳定的环境下工作
需要低延迟响应的交互式任务

Setup

配置步骤

No additional setup required if Ollama is already running. Otherwise:

若Ollama已运行，则无需额外配置。否则请按以下步骤操作：

Ollama Setup

Ollama配置

bash

undefined

bash

undefined

Install Ollama

安装Ollama

curl -fsSL https://ollama.com/install.sh | sh

Pull a model

拉取模型

ollama pull llama3.2

Start the server (default: http://localhost:11434)

启动服务器（默认地址：http://localhost:11434）

ollama serve

undefined

ollama serve

undefined

llama.cpp Server Setup

llama.cpp服务器配置

bash

undefined

bash

undefined

Build llama-server

构建llama-server

make llama-server

Start the server

启动服务器

llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080

undefined

llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080

undefined

vLLM Server Setup

vLLM服务器配置

bash

undefined

bash

undefined

Install vLLM

安装vLLM

pip install vllm

Start the server

启动服务器

vllm serve meta-llama/Llama-3.1-8B-Instruct

undefined

vllm serve meta-llama/Llama-3.1-8B-Instruct

undefined

Usage

使用方法

Query a local model

查询本地模型

bash

node /job/.pi/skills/local-llm-provider/query.js "What is 2+2?" --model llama3.2

bash

node /job/.pi/skills/local-llm-provider/query.js "2+2等于多少？" --model llama3.2

Query with custom parameters

使用自定义参数查询

bash

node /job/.pi/skills/local-llm-provider/query.js "Explain quantum computing" --model mixtral --temp 0.8 --max-tokens 500

bash

node /job/.pi/skills/local-llm-provider/query.js "解释量子计算" --model mixtral --temp 0.8 --max-tokens 500

List available models

列出可用模型

bash

node /job/.pi/skills/local-llm-provider/list-models.js

bash

node /job/.pi/skills/local-llm-provider/list-models.js

Check server health

检查服务器健康状态

bash

node /job/.pi/skills/local-llm-provider/health.js

bash

node /job/.pi/skills/local-llm-provider/health.js

Stream responses

流式响应

bash

node /job/.pi/skills/local-llm-provider/query.js "Tell me a story" --stream

bash

node /job/.pi/skills/local-llm-provider/query.js "给我讲个故事" --stream

Configuration

持久化配置

Create a

config.json

in the skill directory for persistent settings:

json

{
  "providers": [
    {
      "name": "ollama",
      "url": "http://localhost:11434",
      "enabled": true,
      "fallback_order": 1
    },
    {
      "name": "llamacpp",
      "url": "http://localhost:8080/v1",
      "enabled": false,
      "fallback_order": 2
    },
    {
      "name": "vllm",
      "url": "http://localhost:8000/v1",
      "enabled": false,
      "fallback_order": 3
    }
  ],
  "default_model": "llama3.2",
  "fallback_to_cloud": true,
  "cloud_provider": "anthropic",
  "timeout_ms": 120000
}

在技能目录下创建

config.json

文件以设置持久化参数：

json

{
  "providers": [
    {
      "name": "ollama",
      "url": "http://localhost:11434",
      "enabled": true,
      "fallback_order": 1
    },
    {
      "name": "llamacpp",
      "url": "http://localhost:8080/v1",
      "enabled": false,
      "fallback_order": 2
    },
    {
      "name": "vllm",
      "url": "http://localhost:8000/v1",
      "enabled": false,
      "fallback_order": 3
    }
  ],
  "default_model": "llama3.2",
  "fallback_to_cloud": true,
  "cloud_provider": "anthropic",
  "timeout_ms": 120000
}

Provider Fallback

提供商降级机制

The skill implements intelligent fallback:

Primary: Try local Ollama first
Secondary: Try llama.cpp server
Tertiary: Try vLLM server
Fallback: Use cloud provider (if enabled)

Each provider failure triggers automatic retry with the next available provider.

该技能实现了智能降级逻辑：

首选：优先尝试本地Ollama
次选：尝试llama.cpp服务器
第三选择：尝试vLLM服务器
最终降级：使用云提供商（若已启用）

每当某一提供商失败时，会自动重试下一个可用的提供商。

Supported Models

支持的模型

Ollama

llama3.2, llama3.1, llama3
mistral, mixtral
qwen2.5, qwen2
phi3, phi4
gemma2, gemma
codellama
and many more

llama3.2, llama3.1, llama3
mistral, mixtral
qwen2.5, qwen2
phi3, phi4
gemma2, gemma
codellama
以及更多其他模型

llama.cpp

Any GGUF format model
Mistral variants
Llama variants
Qwen variants

任何GGUF格式的模型
Mistral变体
Llama变体
Qwen变体

vLLM

Llama 3.1, 3.0
Mistral
Qwen
Any HuggingFace model

Llama 3.1, 3.0
Mistral
Qwen
任何HuggingFace模型

API Integration

API集成

As a library

作为库使用

javascript

const { LocalLLMProvider } = require('./provider.js');

const provider = new LocalLLMProvider({
  providers: [
    { name: 'ollama', url: 'http://localhost:11434', enabled: true },
    { name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false }
  ],
  default_model: 'llama3.2',
  fallback_to_cloud: true
});

const response = await provider.complete('Hello, how are you?');
console.log(response);

javascript

const { LocalLLMProvider } = require('./provider.js');

const provider = new LocalLLMProvider({
  providers: [
    { name: 'ollama', url: 'http://localhost:11434', enabled: true },
    { name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false }
  ],
  default_model: 'llama3.2',
  fallback_to_cloud: true
});

const response = await provider.complete('你好，最近怎么样？');
console.log(response);

Output Format

输出格式

The query returns JSON:

json

{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "I'm doing well, thank you for asking!",
  "tokens": 42,
  "duration_ms": 1500,
  "done": true
}

When streaming:

json

{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "I",
  "tokens": 1,
  "done": false
}

On fallback failure:

json

{
  "success": false,
  "error": "All providers failed",
  "providers_tried": ["ollama", "llamacpp"],
  "last_error": "Connection refused"
}

查询会返回JSON格式结果：

json

{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "我很好，谢谢你的关心！",
  "tokens": 42,
  "duration_ms": 1500,
  "done": true
}

流式响应时：

json

{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "我",
  "tokens": 1,
  "done": false
}

所有提供商都失败时：

json

{
  "success": false,
  "error": "所有提供商均失败",
  "providers_tried": ["ollama", "llamacpp"],
  "last_error": "连接被拒绝"
}

Environment Variables

环境变量

Variable	Description	Default
`OLLAMA_BASE_URL`	Ollama server URL	http://localhost:11434
`LLAMACPP_BASE_URL`	llama.cpp server URL	http://localhost:8080/v1
`VLLM_BASE_URL`	vLLM server URL	http://localhost:8000/v1
`LOCAL_LLM_DEFAULT_MODEL`	Default model to use	llama3.2

变量名	描述	默认值
`OLLAMA_BASE_URL`	Ollama服务器地址	http://localhost:11434
`LLAMACPP_BASE_URL`	llama.cpp服务器地址	http://localhost:8080/v1
`VLLM_BASE_URL`	vLLM服务器地址	http://localhost:8000/v1
`LOCAL_LLM_DEFAULT_MODEL`	默认使用的模型	llama3.2

Limitations

局限性

Requires local server to be running
Model quality depends on local hardware
Not all models support all features (e.g., function calling)
Some providers have different API formats

需要本地服务器处于运行状态
模型质量取决于本地硬件配置
并非所有模型都支持全部功能（例如函数调用）
不同提供商的API格式存在差异

Tips

使用技巧

For best performance: Use Ollama with GPU acceleration
For variety: Pull multiple models (
```
ollama pull mixtral
```
)
For privacy: Always use local providers first
For reliability: Keep cloud fallback enabled for critical tasks
For speed: Use smaller models (7B) for simple tasks, larger (70B) for complex reasoning

最佳性能：使用带GPU加速的Ollama
模型多样性：拉取多个模型（
```
ollama pull mixtral
```
）
隐私保障：优先使用本地提供商
可靠性保障：针对关键任务保持云降级功能启用
速度优化：简单任务使用小模型（7B参数），复杂推理使用大模型（70B参数）