local-llm-provider
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLocal LLM Provider
本地LLM提供商
Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.
连接本地LLM端点(Ollama、llama.cpp、vLLM),并支持自动降级到云提供商。该技能可让Agent利用本地GPU/CPU进行推理,同时通过智能降级机制保障可靠性。
When to Use
适用场景
- Running LLM inference locally for privacy (data never leaves your machine)
- Using models not available via cloud APIs (e.g., fine-tuned models, Llama variants)
- Reducing API costs for high-volume tasks
- Working offline or with intermittent connectivity
- Need low-latency responses for interactive tasks
- 出于隐私考虑在本地运行LLM推理(数据绝不会离开你的设备)
- 使用云API未提供的模型(例如微调模型、Llama变体)
- 降低高量任务的API成本
- 在离线或网络不稳定的环境下工作
- 需要低延迟响应的交互式任务
Setup
配置步骤
No additional setup required if Ollama is already running. Otherwise:
若Ollama已运行,则无需额外配置。否则请按以下步骤操作:
Ollama Setup
Ollama配置
bash
undefinedbash
undefinedInstall Ollama
安装Ollama
curl -fsSL https://ollama.com/install.sh | sh
curl -fsSL https://ollama.com/install.sh | sh
Pull a model
拉取模型
ollama pull llama3.2
ollama pull llama3.2
Start the server (default: http://localhost:11434)
启动服务器(默认地址:http://localhost:11434)
ollama serve
undefinedollama serve
undefinedllama.cpp Server Setup
llama.cpp服务器配置
bash
undefinedbash
undefinedBuild llama-server
构建llama-server
make llama-server
make llama-server
Start the server
启动服务器
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080
undefinedllama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080
undefinedvLLM Server Setup
vLLM服务器配置
bash
undefinedbash
undefinedInstall vLLM
安装vLLM
pip install vllm
pip install vllm
Start the server
启动服务器
vllm serve meta-llama/Llama-3.1-8B-Instruct
undefinedvllm serve meta-llama/Llama-3.1-8B-Instruct
undefinedUsage
使用方法
Query a local model
查询本地模型
bash
node /job/.pi/skills/local-llm-provider/query.js "What is 2+2?" --model llama3.2bash
node /job/.pi/skills/local-llm-provider/query.js "2+2等于多少?" --model llama3.2Query with custom parameters
使用自定义参数查询
bash
node /job/.pi/skills/local-llm-provider/query.js "Explain quantum computing" --model mixtral --temp 0.8 --max-tokens 500bash
node /job/.pi/skills/local-llm-provider/query.js "解释量子计算" --model mixtral --temp 0.8 --max-tokens 500List available models
列出可用模型
bash
node /job/.pi/skills/local-llm-provider/list-models.jsbash
node /job/.pi/skills/local-llm-provider/list-models.jsCheck server health
检查服务器健康状态
bash
node /job/.pi/skills/local-llm-provider/health.jsbash
node /job/.pi/skills/local-llm-provider/health.jsStream responses
流式响应
bash
node /job/.pi/skills/local-llm-provider/query.js "Tell me a story" --streambash
node /job/.pi/skills/local-llm-provider/query.js "给我讲个故事" --streamConfiguration
持久化配置
Create a in the skill directory for persistent settings:
config.jsonjson
{
"providers": [
{
"name": "ollama",
"url": "http://localhost:11434",
"enabled": true,
"fallback_order": 1
},
{
"name": "llamacpp",
"url": "http://localhost:8080/v1",
"enabled": false,
"fallback_order": 2
},
{
"name": "vllm",
"url": "http://localhost:8000/v1",
"enabled": false,
"fallback_order": 3
}
],
"default_model": "llama3.2",
"fallback_to_cloud": true,
"cloud_provider": "anthropic",
"timeout_ms": 120000
}在技能目录下创建文件以设置持久化参数:
config.jsonjson
{
"providers": [
{
"name": "ollama",
"url": "http://localhost:11434",
"enabled": true,
"fallback_order": 1
},
{
"name": "llamacpp",
"url": "http://localhost:8080/v1",
"enabled": false,
"fallback_order": 2
},
{
"name": "vllm",
"url": "http://localhost:8000/v1",
"enabled": false,
"fallback_order": 3
}
],
"default_model": "llama3.2",
"fallback_to_cloud": true,
"cloud_provider": "anthropic",
"timeout_ms": 120000
}Provider Fallback
提供商降级机制
The skill implements intelligent fallback:
- Primary: Try local Ollama first
- Secondary: Try llama.cpp server
- Tertiary: Try vLLM server
- Fallback: Use cloud provider (if enabled)
Each provider failure triggers automatic retry with the next available provider.
该技能实现了智能降级逻辑:
- 首选:优先尝试本地Ollama
- 次选:尝试llama.cpp服务器
- 第三选择:尝试vLLM服务器
- 最终降级:使用云提供商(若已启用)
每当某一提供商失败时,会自动重试下一个可用的提供商。
Supported Models
支持的模型
Ollama
Ollama
- llama3.2, llama3.1, llama3
- mistral, mixtral
- qwen2.5, qwen2
- phi3, phi4
- gemma2, gemma
- codellama
- and many more
- llama3.2, llama3.1, llama3
- mistral, mixtral
- qwen2.5, qwen2
- phi3, phi4
- gemma2, gemma
- codellama
- 以及更多其他模型
llama.cpp
llama.cpp
- Any GGUF format model
- Mistral variants
- Llama variants
- Qwen variants
- 任何GGUF格式的模型
- Mistral变体
- Llama变体
- Qwen变体
vLLM
vLLM
- Llama 3.1, 3.0
- Mistral
- Qwen
- Any HuggingFace model
- Llama 3.1, 3.0
- Mistral
- Qwen
- 任何HuggingFace模型
API Integration
API集成
As a library
作为库使用
javascript
const { LocalLLMProvider } = require('./provider.js');
const provider = new LocalLLMProvider({
providers: [
{ name: 'ollama', url: 'http://localhost:11434', enabled: true },
{ name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false }
],
default_model: 'llama3.2',
fallback_to_cloud: true
});
const response = await provider.complete('Hello, how are you?');
console.log(response);javascript
const { LocalLLMProvider } = require('./provider.js');
const provider = new LocalLLMProvider({
providers: [
{ name: 'ollama', url: 'http://localhost:11434', enabled: true },
{ name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false }
],
default_model: 'llama3.2',
fallback_to_cloud: true
});
const response = await provider.complete('你好,最近怎么样?');
console.log(response);Output Format
输出格式
The query returns JSON:
json
{
"success": true,
"provider": "ollama",
"model": "llama3.2",
"response": "I'm doing well, thank you for asking!",
"tokens": 42,
"duration_ms": 1500,
"done": true
}When streaming:
json
{
"success": true,
"provider": "ollama",
"model": "llama3.2",
"response": "I",
"tokens": 1,
"done": false
}On fallback failure:
json
{
"success": false,
"error": "All providers failed",
"providers_tried": ["ollama", "llamacpp"],
"last_error": "Connection refused"
}查询会返回JSON格式结果:
json
{
"success": true,
"provider": "ollama",
"model": "llama3.2",
"response": "我很好,谢谢你的关心!",
"tokens": 42,
"duration_ms": 1500,
"done": true
}流式响应时:
json
{
"success": true,
"provider": "ollama",
"model": "llama3.2",
"response": "我",
"tokens": 1,
"done": false
}所有提供商都失败时:
json
{
"success": false,
"error": "所有提供商均失败",
"providers_tried": ["ollama", "llamacpp"],
"last_error": "连接被拒绝"
}Environment Variables
环境变量
| Variable | Description | Default |
|---|---|---|
| Ollama server URL | http://localhost:11434 |
| llama.cpp server URL | http://localhost:8080/v1 |
| vLLM server URL | http://localhost:8000/v1 |
| Default model to use | llama3.2 |
| 变量名 | 描述 | 默认值 |
|---|---|---|
| Ollama服务器地址 | http://localhost:11434 |
| llama.cpp服务器地址 | http://localhost:8080/v1 |
| vLLM服务器地址 | http://localhost:8000/v1 |
| 默认使用的模型 | llama3.2 |
Limitations
局限性
- Requires local server to be running
- Model quality depends on local hardware
- Not all models support all features (e.g., function calling)
- Some providers have different API formats
- 需要本地服务器处于运行状态
- 模型质量取决于本地硬件配置
- 并非所有模型都支持全部功能(例如函数调用)
- 不同提供商的API格式存在差异
Tips
使用技巧
- For best performance: Use Ollama with GPU acceleration
- For variety: Pull multiple models ()
ollama pull mixtral - For privacy: Always use local providers first
- For reliability: Keep cloud fallback enabled for critical tasks
- For speed: Use smaller models (7B) for simple tasks, larger (70B) for complex reasoning
- 最佳性能:使用带GPU加速的Ollama
- 模型多样性:拉取多个模型()
ollama pull mixtral - 隐私保障:优先使用本地提供商
- 可靠性保障:针对关键任务保持云降级功能启用
- 速度优化:简单任务使用小模型(7B参数),复杂推理使用大模型(70B参数)