ollama-local

Original🇺🇸 English
Translated
1 scriptsChecked / no sensitive code detected

Local LLM inference with Ollama. Use when setting up local models for development, CI pipelines, or cost reduction. Covers model selection, LangChain integration, and performance tuning.

5installs
Added on

NPX Install

npx skill4agent add yonatangross/orchestkit ollama-local

Ollama Local Inference

Run LLMs locally for cost savings, privacy, and offline development.

Quick Start

bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull models
ollama pull deepseek-r1:70b      # Reasoning (GPT-4 level)
ollama pull qwen2.5-coder:32b    # Coding
ollama pull nomic-embed-text     # Embeddings

# Start server
ollama serve

Recommended Models (M4 Max 256GB)

TaskModelSizeNotes
Reasoning
deepseek-r1:70b
~42GBGPT-4 level
Coding
qwen2.5-coder:32b
~35GB73.7% Aider benchmark
Embeddings
nomic-embed-text
~0.5GB768 dims, fast
General
llama3.3:70b
~40GBGood all-around

LangChain Integration

python
from langchain_ollama import ChatOllama, OllamaEmbeddings

# Chat model
llm = ChatOllama(
    model="deepseek-r1:70b",
    base_url="http://localhost:11434",
    temperature=0.0,
    num_ctx=32768,      # Context window
    keep_alive="5m",    # Keep model loaded
)

# Embeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434",
)

# Generate
response = await llm.ainvoke("Explain async/await")
vector = await embeddings.aembed_query("search text")

Tool Calling with Ollama

python
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the document database."""
    return f"Found results for: {query}"

# Bind tools
llm_with_tools = llm.bind_tools([search_docs])
response = await llm_with_tools.ainvoke("Search for Python patterns")

Structured Output

python
from pydantic import BaseModel, Field

class CodeAnalysis(BaseModel):
    language: str = Field(description="Programming language")
    complexity: int = Field(ge=1, le=10)
    issues: list[str] = Field(description="Found issues")

structured_llm = llm.with_structured_output(CodeAnalysis)
result = await structured_llm.ainvoke("Analyze this code: ...")
# result is typed CodeAnalysis object

Provider Factory Pattern

python
import os

def get_llm_provider(task_type: str = "general"):
    """Auto-switch between Ollama and cloud APIs."""
    if os.getenv("OLLAMA_ENABLED") == "true":
        models = {
            "reasoning": "deepseek-r1:70b",
            "coding": "qwen2.5-coder:32b",
            "general": "llama3.3:70b",
        }
        return ChatOllama(
            model=models.get(task_type, "llama3.3:70b"),
            keep_alive="5m"
        )
    else:
        # Fall back to cloud API
        return ChatOpenAI(model="gpt-5.2")

# Usage
llm = get_llm_provider(task_type="coding")

Environment Configuration

bash
# .env.local
OLLAMA_ENABLED=true
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL_REASONING=deepseek-r1:70b
OLLAMA_MODEL_CODING=qwen2.5-coder:32b
OLLAMA_MODEL_EMBED=nomic-embed-text

# Performance tuning (Apple Silicon)
OLLAMA_MAX_LOADED_MODELS=3    # Keep 3 models in memory
OLLAMA_KEEP_ALIVE=5m          # 5 minute keep-alive

CI Integration

yaml
# GitHub Actions (self-hosted runner)
jobs:
  test:
    runs-on: self-hosted  # M4 Max runner
    env:
      OLLAMA_ENABLED: "true"
    steps:
      - name: Pre-warm models
        run: |
          curl -s http://localhost:11434/api/embeddings \
            -d '{"model":"nomic-embed-text","prompt":"warmup"}' > /dev/null

      - name: Run tests
        run: pytest tests/

Cost Comparison

ProviderMonthly CostLatency
Cloud APIs~$675/month200-500ms
Ollama Local~$50 (electricity)50-200ms
Savings93%2-3x faster

Best Practices

  • DO use
    keep_alive="5m"
    in CI (avoid cold starts)
  • DO pre-warm models before first call
  • DO set
    num_ctx=32768
    on Apple Silicon
  • DO use provider factory for cloud/local switching
  • DON'T use
    keep_alive=-1
    (wastes memory)
  • DON'T skip pre-warming in CI (30-60s cold start)

Troubleshooting

bash
# Check if Ollama is running
curl http://localhost:11434/api/tags

# List loaded models
ollama list

# Check model memory usage
ollama ps

# Pull specific version
ollama pull deepseek-r1:70b-q4_K_M

Related Skills

  • embeddings
    - Embedding patterns (works with nomic-embed-text)
  • llm-evaluation
    - Testing with local models
  • cost-optimization
    - Broader cost strategies

Capability Details

setup

Keywords: setup, install, configure, ollama Solves:
  • Set up Ollama locally
  • Configure for development
  • Install models

model-selection

Keywords: model, llama, mistral, qwen, selection Solves:
  • Choose appropriate model
  • Compare model capabilities
  • Balance speed vs quality

provider-template

Keywords: provider, template, python, implementation Solves:
  • Ollama provider template
  • Python implementation
  • Drop-in LLM provider