Ollama Local Inference

Ollama 本地推理

Run LLMs locally for cost savings, privacy, and offline development.

在本地运行大语言模型（LLM）以节省成本、保障隐私并支持离线开发。

Quick Start

快速开始

bash

undefined

bash

undefined

Install Ollama

curl -fsSL https://ollama.ai/install.sh | sh

Pull models

ollama pull deepseek-r1:70b # Reasoning (GPT-4 level) ollama pull qwen2.5-coder:32b # Coding ollama pull nomic-embed-text # Embeddings

Start server

ollama serve

undefined

ollama serve

undefined

Recommended Models (M4 Max 256GB)

Task	Model	Size	Notes
Reasoning	`deepseek-r1:70b`	~42GB	GPT-4 level
Coding	`qwen2.5-coder:32b`	~35GB	73.7% Aider benchmark
Embeddings	`nomic-embed-text`	~0.5GB	768 dims, fast
General	`llama3.3:70b`	~40GB	Good all-around

任务	模型	大小	说明
推理	`deepseek-r1:70b`	~42GB	GPT-4级别
编码	`qwen2.5-coder:32b`	~35GB	Aider基准测试得分73.7%
嵌入	`nomic-embed-text`	~0.5GB	768维度，速度快
通用	`llama3.3:70b`	~40GB	全能型表现优异

LangChain Integration

LangChain 集成

python

from langchain_ollama import ChatOllama, OllamaEmbeddings

python

from langchain_ollama import ChatOllama, OllamaEmbeddings

Chat model

llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, # Context window keep_alive="5m", # Keep model loaded )

Embeddings

embeddings = OllamaEmbeddings( model="nomic-embed-text", base_url="http://localhost:11434", )

Generate

response = await llm.ainvoke("Explain async/await") vector = await embeddings.aembed_query("search text")

undefined

response = await llm.ainvoke("Explain async/await") vector = await embeddings.aembed_query("search text")

undefined

Tool Calling with Ollama

Ollama 工具调用

python

from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the document database."""
    return f"Found results for: {query}"

python

from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the document database."""
    return f"Found results for: {query}"

Bind tools

llm_with_tools = llm.bind_tools([search_docs]) response = await llm_with_tools.ainvoke("Search for Python patterns")

undefined

llm_with_tools = llm.bind_tools([search_docs]) response = await llm_with_tools.ainvoke("Search for Python patterns")

undefined

Structured Output

结构化输出

python

from pydantic import BaseModel, Field

class CodeAnalysis(BaseModel):
    language: str = Field(description="Programming language")
    complexity: int = Field(ge=1, le=10)
    issues: list[str] = Field(description="Found issues")

structured_llm = llm.with_structured_output(CodeAnalysis)
result = await structured_llm.ainvoke("Analyze this code: ...")

python

from pydantic import BaseModel, Field

class CodeAnalysis(BaseModel):
    language: str = Field(description="Programming language")
    complexity: int = Field(ge=1, le=10)
    issues: list[str] = Field(description="Found issues")

structured_llm = llm.with_structured_output(CodeAnalysis)
result = await structured_llm.ainvoke("Analyze this code: ...")

result is typed CodeAnalysis object

undefined

undefined

Provider Factory Pattern

供应商工厂模式

python

import os

def get_llm_provider(task_type: str = "general"):
    """Auto-switch between Ollama and cloud APIs."""
    if os.getenv("OLLAMA_ENABLED") == "true":
        models = {
            "reasoning": "deepseek-r1:70b",
            "coding": "qwen2.5-coder:32b",
            "general": "llama3.3:70b",
        }
        return ChatOllama(
            model=models.get(task_type, "llama3.3:70b"),
            keep_alive="5m"
        )
    else:
        # Fall back to cloud API
        return ChatOpenAI(model="gpt-5.2")

python

import os

def get_llm_provider(task_type: str = "general"):
    """Auto-switch between Ollama and cloud APIs."""
    if os.getenv("OLLAMA_ENABLED") == "true":
        models = {
            "reasoning": "deepseek-r1:70b",
            "coding": "qwen2.5-coder:32b",
            "general": "llama3.3:70b",
        }
        return ChatOllama(
            model=models.get(task_type, "llama3.3:70b"),
            keep_alive="5m"
        )
    else:
        # Fall back to cloud API
        return ChatOpenAI(model="gpt-5.2")

Usage

llm = get_llm_provider(task_type="coding")

undefined

llm = get_llm_provider(task_type="coding")

undefined

Environment Configuration

环境配置

bash

undefined

bash

undefined

.env.local

OLLAMA_ENABLED=true OLLAMA_HOST=http://localhost:11434 OLLAMA_MODEL_REASONING=deepseek-r1:70b OLLAMA_MODEL_CODING=qwen2.5-coder:32b OLLAMA_MODEL_EMBED=nomic-embed-text

Performance tuning (Apple Silicon)

OLLAMA_MAX_LOADED_MODELS=3 # Keep 3 models in memory OLLAMA_KEEP_ALIVE=5m # 5 minute keep-alive

undefined

OLLAMA_MAX_LOADED_MODELS=3 # Keep 3 models in memory OLLAMA_KEEP_ALIVE=5m # 5 minute keep-alive

undefined

CI Integration

CI 集成

yaml

undefined

yaml

undefined

GitHub Actions (self-hosted runner)

jobs: test: runs-on: self-hosted # M4 Max runner env: OLLAMA_ENABLED: "true" steps: - name: Pre-warm models run: | curl -s http://localhost:11434/api/embeddings
-d '{"model":"nomic-embed-text","prompt":"warmup"}' > /dev/null

  - name: Run tests
    run: pytest tests/

undefined

jobs: test: runs-on: self-hosted # M4 Max runner env: OLLAMA_ENABLED: "true" steps: - name: Pre-warm models run: | curl -s http://localhost:11434/api/embeddings
-d '{"model":"nomic-embed-text","prompt":"warmup"}' > /dev/null

  - name: Run tests
    run: pytest tests/

undefined

Cost Comparison

成本对比

Provider	Monthly Cost	Latency
Cloud APIs	~$675/month	200-500ms
Ollama Local	~$50 (electricity)	50-200ms
Savings	93%	2-3x faster

供应商	月度成本	延迟
云API	~675美元/月	200-500毫秒
Ollama本地部署	~50美元（电费）	50-200毫秒
节省比例	93%	速度提升2-3倍

Best Practices

最佳实践

DO use
```
keep_alive="5m"
```
in CI (avoid cold starts)
DO pre-warm models before first call
DO set
```
num_ctx=32768
```
on Apple Silicon
DO use provider factory for cloud/local switching
DON'T use
```
keep_alive=-1
```
(wastes memory)
DON'T skip pre-warming in CI (30-60s cold start)

建议在CI中使用
```
keep_alive="5m"
```
（避免冷启动）
建议在首次调用前预热模型
建议在Apple Silicon设备上设置
```
num_ctx=32768
```
建议使用供应商工厂模式在云/本地部署间切换
不建议 使用
```
keep_alive=-1
```
（浪费内存）
不建议 在CI中跳过预热步骤（冷启动耗时30-60秒）

Troubleshooting

故障排查

bash

undefined

bash

undefined

Check if Ollama is running

curl http://localhost:11434/api/tags

List loaded models

ollama list

Check model memory usage

ollama ps

Pull specific version

ollama pull deepseek-r1:70b-q4_K_M

undefined

ollama pull deepseek-r1:70b-q4_K_M

undefined

Related Skills

Capability Details

功能详情

setup

搭建

Keywords: setup, install, configure, ollama Solves:

Set up Ollama locally
Configure for development
Install models

关键词: setup, install, configure, ollama 解决场景:

在本地搭建Ollama
为开发场景配置Ollama
安装模型

model-selection

模型选择

Keywords: model, llama, mistral, qwen, selection Solves:

Choose appropriate model
Compare model capabilities
Balance speed vs quality

关键词: model, llama, mistral, qwen, selection 解决场景:

选择合适的模型
对比模型能力
平衡速度与质量

provider-template

供应商模板

Keywords: provider, template, python, implementation Solves:

Ollama provider template
Python implementation
Drop-in LLM provider

关键词: provider, template, python, implementation 解决场景:

Ollama供应商模板
Python实现
可直接替换的LLM供应商

ollama-local

Original

Translation

Ollama Local Inference

Ollama 本地推理

Quick Start

快速开始

Install Ollama

Install Ollama

Pull models

Pull models

Start server

Start server

Recommended Models (M4 Max 256GB)

推荐模型（M4 Max 256GB）

LangChain Integration

LangChain 集成

Chat model

Chat model

Embeddings

Embeddings

Generate

Generate

Tool Calling with Ollama

Ollama 工具调用

Bind tools

Bind tools

Structured Output

结构化输出

result is typed CodeAnalysis object

result is typed CodeAnalysis object

Provider Factory Pattern

供应商工厂模式

Usage

Usage

Environment Configuration

环境配置

.env.local

.env.local

Performance tuning (Apple Silicon)

Performance tuning (Apple Silicon)

CI Integration

CI 集成

GitHub Actions (self-hosted runner)

GitHub Actions (self-hosted runner)

Cost Comparison

成本对比

Best Practices

最佳实践

Troubleshooting

故障排查

Check if Ollama is running

Check if Ollama is running

List loaded models

List loaded models

Check model memory usage

Check model memory usage

Pull specific version

Pull specific version

Related Skills

相关技能

Capability Details

功能详情

setup

搭建

model-selection

模型选择

provider-template

供应商模板