ollama-local

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Ollama Local Inference

Ollama 本地推理

Run LLMs locally for cost savings, privacy, and offline development.
在本地运行大语言模型(LLM)以节省成本、保障隐私并支持离线开发。

Quick Start

快速开始

bash
undefined
bash
undefined

Install Ollama

Install Ollama

Pull models

Pull models

ollama pull deepseek-r1:70b # Reasoning (GPT-4 level) ollama pull qwen2.5-coder:32b # Coding ollama pull nomic-embed-text # Embeddings
ollama pull deepseek-r1:70b # Reasoning (GPT-4 level) ollama pull qwen2.5-coder:32b # Coding ollama pull nomic-embed-text # Embeddings

Start server

Start server

ollama serve
undefined
ollama serve
undefined

Recommended Models (M4 Max 256GB)

推荐模型(M4 Max 256GB)

TaskModelSizeNotes
Reasoning
deepseek-r1:70b
~42GBGPT-4 level
Coding
qwen2.5-coder:32b
~35GB73.7% Aider benchmark
Embeddings
nomic-embed-text
~0.5GB768 dims, fast
General
llama3.3:70b
~40GBGood all-around
任务模型大小说明
推理
deepseek-r1:70b
~42GBGPT-4级别
编码
qwen2.5-coder:32b
~35GBAider基准测试得分73.7%
嵌入
nomic-embed-text
~0.5GB768维度,速度快
通用
llama3.3:70b
~40GB全能型表现优异

LangChain Integration

LangChain 集成

python
from langchain_ollama import ChatOllama, OllamaEmbeddings
python
from langchain_ollama import ChatOllama, OllamaEmbeddings

Chat model

Chat model

llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, # Context window keep_alive="5m", # Keep model loaded )
llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, # Context window keep_alive="5m", # Keep model loaded )

Embeddings

Embeddings

embeddings = OllamaEmbeddings( model="nomic-embed-text", base_url="http://localhost:11434", )
embeddings = OllamaEmbeddings( model="nomic-embed-text", base_url="http://localhost:11434", )

Generate

Generate

response = await llm.ainvoke("Explain async/await") vector = await embeddings.aembed_query("search text")
undefined
response = await llm.ainvoke("Explain async/await") vector = await embeddings.aembed_query("search text")
undefined

Tool Calling with Ollama

Ollama 工具调用

python
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the document database."""
    return f"Found results for: {query}"
python
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the document database."""
    return f"Found results for: {query}"

Bind tools

Bind tools

llm_with_tools = llm.bind_tools([search_docs]) response = await llm_with_tools.ainvoke("Search for Python patterns")
undefined
llm_with_tools = llm.bind_tools([search_docs]) response = await llm_with_tools.ainvoke("Search for Python patterns")
undefined

Structured Output

结构化输出

python
from pydantic import BaseModel, Field

class CodeAnalysis(BaseModel):
    language: str = Field(description="Programming language")
    complexity: int = Field(ge=1, le=10)
    issues: list[str] = Field(description="Found issues")

structured_llm = llm.with_structured_output(CodeAnalysis)
result = await structured_llm.ainvoke("Analyze this code: ...")
python
from pydantic import BaseModel, Field

class CodeAnalysis(BaseModel):
    language: str = Field(description="Programming language")
    complexity: int = Field(ge=1, le=10)
    issues: list[str] = Field(description="Found issues")

structured_llm = llm.with_structured_output(CodeAnalysis)
result = await structured_llm.ainvoke("Analyze this code: ...")

result is typed CodeAnalysis object

result is typed CodeAnalysis object

undefined
undefined

Provider Factory Pattern

供应商工厂模式

python
import os

def get_llm_provider(task_type: str = "general"):
    """Auto-switch between Ollama and cloud APIs."""
    if os.getenv("OLLAMA_ENABLED") == "true":
        models = {
            "reasoning": "deepseek-r1:70b",
            "coding": "qwen2.5-coder:32b",
            "general": "llama3.3:70b",
        }
        return ChatOllama(
            model=models.get(task_type, "llama3.3:70b"),
            keep_alive="5m"
        )
    else:
        # Fall back to cloud API
        return ChatOpenAI(model="gpt-5.2")
python
import os

def get_llm_provider(task_type: str = "general"):
    """Auto-switch between Ollama and cloud APIs."""
    if os.getenv("OLLAMA_ENABLED") == "true":
        models = {
            "reasoning": "deepseek-r1:70b",
            "coding": "qwen2.5-coder:32b",
            "general": "llama3.3:70b",
        }
        return ChatOllama(
            model=models.get(task_type, "llama3.3:70b"),
            keep_alive="5m"
        )
    else:
        # Fall back to cloud API
        return ChatOpenAI(model="gpt-5.2")

Usage

Usage

llm = get_llm_provider(task_type="coding")
undefined
llm = get_llm_provider(task_type="coding")
undefined

Environment Configuration

环境配置

bash
undefined
bash
undefined

.env.local

.env.local

OLLAMA_ENABLED=true OLLAMA_HOST=http://localhost:11434 OLLAMA_MODEL_REASONING=deepseek-r1:70b OLLAMA_MODEL_CODING=qwen2.5-coder:32b OLLAMA_MODEL_EMBED=nomic-embed-text
OLLAMA_ENABLED=true OLLAMA_HOST=http://localhost:11434 OLLAMA_MODEL_REASONING=deepseek-r1:70b OLLAMA_MODEL_CODING=qwen2.5-coder:32b OLLAMA_MODEL_EMBED=nomic-embed-text

Performance tuning (Apple Silicon)

Performance tuning (Apple Silicon)

OLLAMA_MAX_LOADED_MODELS=3 # Keep 3 models in memory OLLAMA_KEEP_ALIVE=5m # 5 minute keep-alive
undefined
OLLAMA_MAX_LOADED_MODELS=3 # Keep 3 models in memory OLLAMA_KEEP_ALIVE=5m # 5 minute keep-alive
undefined

CI Integration

CI 集成

yaml
undefined
yaml
undefined

GitHub Actions (self-hosted runner)

GitHub Actions (self-hosted runner)

jobs: test: runs-on: self-hosted # M4 Max runner env: OLLAMA_ENABLED: "true" steps: - name: Pre-warm models run: | curl -s http://localhost:11434/api/embeddings
-d '{"model":"nomic-embed-text","prompt":"warmup"}' > /dev/null
  - name: Run tests
    run: pytest tests/
undefined
jobs: test: runs-on: self-hosted # M4 Max runner env: OLLAMA_ENABLED: "true" steps: - name: Pre-warm models run: | curl -s http://localhost:11434/api/embeddings
-d '{"model":"nomic-embed-text","prompt":"warmup"}' > /dev/null
  - name: Run tests
    run: pytest tests/
undefined

Cost Comparison

成本对比

ProviderMonthly CostLatency
Cloud APIs~$675/month200-500ms
Ollama Local~$50 (electricity)50-200ms
Savings93%2-3x faster
供应商月度成本延迟
云API~675美元/月200-500毫秒
Ollama本地部署~50美元(电费)50-200毫秒
节省比例93%速度提升2-3倍

Best Practices

最佳实践

  • DO use
    keep_alive="5m"
    in CI (avoid cold starts)
  • DO pre-warm models before first call
  • DO set
    num_ctx=32768
    on Apple Silicon
  • DO use provider factory for cloud/local switching
  • DON'T use
    keep_alive=-1
    (wastes memory)
  • DON'T skip pre-warming in CI (30-60s cold start)
  • 建议 在CI中使用
    keep_alive="5m"
    (避免冷启动)
  • 建议 在首次调用前预热模型
  • 建议 在Apple Silicon设备上设置
    num_ctx=32768
  • 建议 使用供应商工厂模式在云/本地部署间切换
  • 不建议 使用
    keep_alive=-1
    (浪费内存)
  • 不建议 在CI中跳过预热步骤(冷启动耗时30-60秒)

Troubleshooting

故障排查

bash
undefined
bash
undefined

Check if Ollama is running

Check if Ollama is running

List loaded models

List loaded models

ollama list
ollama list

Check model memory usage

Check model memory usage

ollama ps
ollama ps

Pull specific version

Pull specific version

ollama pull deepseek-r1:70b-q4_K_M
undefined
ollama pull deepseek-r1:70b-q4_K_M
undefined

Related Skills

相关技能

  • embeddings
    - Embedding patterns (works with nomic-embed-text)
  • llm-evaluation
    - Testing with local models
  • cost-optimization
    - Broader cost strategies
  • embeddings
    - 嵌入模式(兼容nomic-embed-text)
  • llm-evaluation
    - 使用本地模型进行测试
  • cost-optimization
    - 更全面的成本优化策略

Capability Details

功能详情

setup

搭建

Keywords: setup, install, configure, ollama Solves:
  • Set up Ollama locally
  • Configure for development
  • Install models
关键词: setup, install, configure, ollama 解决场景:
  • 在本地搭建Ollama
  • 为开发场景配置Ollama
  • 安装模型

model-selection

模型选择

Keywords: model, llama, mistral, qwen, selection Solves:
  • Choose appropriate model
  • Compare model capabilities
  • Balance speed vs quality
关键词: model, llama, mistral, qwen, selection 解决场景:
  • 选择合适的模型
  • 对比模型能力
  • 平衡速度与质量

provider-template

供应商模板

Keywords: provider, template, python, implementation Solves:
  • Ollama provider template
  • Python implementation
  • Drop-in LLM provider
关键词: provider, template, python, implementation 解决场景:
  • Ollama供应商模板
  • Python实现
  • 可直接替换的LLM供应商