llama-cpp

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

llama.cpp

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
基于纯C/C++的LLM推理库,依赖项极少,针对CPU和非NVIDIA硬件进行了优化。

When to use llama.cpp

何时使用llama.cpp

Use llama.cpp when:
  • Running on CPU-only machines
  • Deploying on Apple Silicon (M1/M2/M3/M4)
  • Using AMD or Intel GPUs (no CUDA)
  • Edge deployment (Raspberry Pi, embedded systems)
  • Need simple deployment without Docker/Python
Use TensorRT-LLM instead when:
  • Have NVIDIA GPUs (A100/H100)
  • Need maximum throughput (100K+ tok/s)
  • Running in datacenter with CUDA
Use vLLM instead when:
  • Have NVIDIA GPUs
  • Need Python-first API
  • Want PagedAttention
在以下场景使用llama.cpp:
  • 在仅配备CPU的机器上运行
  • 在Apple Silicon(M1/M2/M3/M4)设备上部署
  • 使用AMD或Intel GPU(无CUDA支持)
  • 边缘部署(树莓派、嵌入式系统)
  • 需要无需Docker/Python的简易部署方案
在以下场景改用TensorRT-LLM:
  • 拥有NVIDIA GPU(A100/H100)
  • 需要最大吞吐量(100K+ 令牌/秒)
  • 在配备CUDA的数据中心环境中运行
在以下场景改用vLLM:
  • 拥有NVIDIA GPU
  • 需要优先支持Python API
  • 希望使用PagedAttention技术

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

macOS/Linux

macOS/Linux

brew install llama.cpp
brew install llama.cpp

Or build from source

或者从源码构建

git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make

With Metal (Apple Silicon)

启用Metal(Apple Silicon)

make LLAMA_METAL=1
make LLAMA_METAL=1

With CUDA (NVIDIA)

启用CUDA(NVIDIA)

make LLAMA_CUDA=1
make LLAMA_CUDA=1

With ROCm (AMD)

启用ROCm(AMD)

make LLAMA_HIP=1
undefined
make LLAMA_HIP=1
undefined

Download model

下载模型

bash
undefined
bash
undefined

Download from HuggingFace (GGUF format)

从HuggingFace下载(GGUF格式)

huggingface-cli download
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/
huggingface-cli download
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/

Or convert from HuggingFace

或者从HuggingFace格式转换

python convert_hf_to_gguf.py models/llama-2-7b-chat/
undefined
python convert_hf_to_gguf.py models/llama-2-7b-chat/
undefined

Run inference

运行推理

bash
undefined
bash
undefined

Simple chat

简单对话

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # Max tokens
./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # 最大令牌数

Interactive chat

交互式对话

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive
undefined
./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive
undefined

Server mode

服务器模式

bash
undefined
bash
undefined

Start OpenAI-compatible server

启动兼容OpenAI的服务器

./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # Offload 32 layers to GPU
./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # 将32层卸载到GPU

Client request

客户端请求

curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
undefined
curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
undefined

Quantization formats

量化格式

GGUF format overview

GGUF格式概述

FormatBitsSize (7B)SpeedQualityUse Case
Q4_K_M4.54.1 GBFastGoodRecommended default
Q4_K_S4.33.9 GBFasterLowerSpeed critical
Q5_K_M5.54.8 GBMediumBetterQuality critical
Q6_K6.55.5 GBSlowerBestMaximum quality
Q8_08.07.0 GBSlowExcellentMinimal degradation
Q2_K2.52.7 GBFastestPoorTesting only
格式位深7B模型大小速度质量适用场景
Q4_K_M4.54.1 GB良好推荐默认选项
Q4_K_S4.33.9 GB更快较低对速度要求极高
Q5_K_M5.54.8 GB中等更好对质量要求极高
Q6_K6.55.5 GB较慢最佳追求最高质量
Q8_08.07.0 GB极佳质量损失极小
Q2_K2.52.7 GB最快较差仅用于测试

Choosing quantization

选择量化方案

bash
undefined
bash
undefined

General use (balanced)

通用场景(平衡型)

Q4_K_M # 4-bit, medium quality
Q4_K_M # 4位,中等质量

Maximum speed (more degradation)

追求极致速度(质量损失较大)

Q2_K or Q3_K_M
Q2_K 或 Q3_K_M

Maximum quality (slower)

追求极致质量(速度较慢)

Q6_K or Q8_0
Q6_K 或 Q8_0

Very large models (70B, 405B)

超大规模模型(70B、405B)

Q3_K_M or Q4_K_S # Lower bits to fit in memory
undefined
Q3_K_M 或 Q4_K_S # 低位数以适配内存
undefined

Hardware acceleration

硬件加速

Apple Silicon (Metal)

Apple Silicon(Metal)

bash
undefined
bash
undefined

Build with Metal

启用Metal构建

make LLAMA_METAL=1
make LLAMA_METAL=1

Run with GPU acceleration (automatic)

启用GPU加速运行(自动生效)

./llama-cli -m model.gguf -ngl 999 # Offload all layers
./llama-cli -m model.gguf -ngl 999 # 卸载所有层

Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

性能表现:M3 Max 可达40-60令牌/秒(Llama 2-7B Q4_K_M)

undefined
undefined

NVIDIA GPUs (CUDA)

NVIDIA GPU(CUDA)

bash
undefined
bash
undefined

Build with CUDA

启用CUDA构建

make LLAMA_CUDA=1
make LLAMA_CUDA=1

Offload layers to GPU

将部分层卸载到GPU

./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
./llama-cli -m model.gguf -ngl 35 # 卸载35/40层

Hybrid CPU+GPU for large models

混合CPU+GPU运行大模型

./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
undefined
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU:20层,CPU:剩余层
undefined

AMD GPUs (ROCm)

AMD GPU(ROCm)

bash
undefined
bash
undefined

Build with ROCm

启用ROCm构建

make LLAMA_HIP=1
make LLAMA_HIP=1

Run with AMD GPU

使用AMD GPU运行

./llama-cli -m model.gguf -ngl 999
undefined
./llama-cli -m model.gguf -ngl 999
undefined

Common patterns

常见使用模式

Batch processing

批量处理

bash
undefined
bash
undefined

Process multiple prompts from file

处理文件中的多个提示词

cat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100
undefined
cat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100
undefined

Constrained generation

约束生成

bash
undefined
bash
undefined

JSON output with grammar

生成符合JSON格式的结果

./llama-cli
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf
./llama-cli
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf

Outputs valid JSON only

仅输出合法JSON

undefined
undefined

Context size

上下文窗口大小

bash
undefined
bash
undefined

Increase context (default 512)

增大上下文窗口(默认512)

./llama-cli
-m model.gguf
-c 4096 # 4K context window
./llama-cli
-m model.gguf
-c 4096 # 4K上下文窗口

Very long context (if model supports)

超长上下文(若模型支持)

./llama-cli -m model.gguf -c 32768 # 32K context
undefined
./llama-cli -m model.gguf -c 32768 # 32K上下文
undefined

Performance benchmarks

性能基准测试

CPU performance (Llama 2-7B Q4_K_M)

CPU性能(Llama 2-7B Q4_K_M)

CPUThreadsSpeedCost
Apple M3 Max1650 tok/s$0 (local)
AMD Ryzen 9 7950X3235 tok/s$0.50/hour
Intel i9-13900K3230 tok/s$0.40/hour
AWS c7i.16xlarge6440 tok/s$2.88/hour
CPU线程数速度成本
Apple M3 Max1650 令牌/秒0美元(本地)
AMD Ryzen 9 7950X3235 令牌/秒0.50美元/小时
Intel i9-13900K3230 令牌/秒0.40美元/小时
AWS c7i.16xlarge6440 令牌/秒2.88美元/小时

GPU acceleration (Llama 2-7B Q4_K_M)

GPU加速性能(Llama 2-7B Q4_K_M)

GPUSpeedvs CPUCost
NVIDIA RTX 4090120 tok/s3-4×$0 (local)
NVIDIA A1080 tok/s2-3×$1.00/hour
AMD MI25070 tok/s$2.00/hour
Apple M3 Max (Metal)50 tok/s~Same$0 (local)
GPU速度对比CPU成本
NVIDIA RTX 4090120 令牌/秒3-4倍0美元(本地)
NVIDIA A1080 令牌/秒2-3倍1.00美元/小时
AMD MI25070 令牌/秒2倍2.00美元/小时
Apple M3 Max(Metal)50 令牌/秒近似0美元(本地)

Supported models

支持的模型

LLaMA family:
  • Llama 2 (7B, 13B, 70B)
  • Llama 3 (8B, 70B, 405B)
  • Code Llama
Mistral family:
  • Mistral 7B
  • Mixtral 8x7B, 8x22B
Other:
  • Falcon, BLOOM, GPT-J
  • Phi-3, Gemma, Qwen
  • LLaVA (vision), Whisper (audio)
LLaMA系列:
  • Llama 2(7B、13B、70B)
  • Llama 3(8B、70B、405B)
  • Code Llama
Mistral系列:
  • Mistral 7B
  • Mixtral 8x7B、8x22B
其他模型:
  • Falcon、BLOOM、GPT-J
  • Phi-3、Gemma、Qwen
  • LLaVA(多模态)、Whisper(音频)

References

参考资料

  • Quantization Guide - GGUF formats, conversion, quality comparison
  • Server Deployment - API endpoints, Docker, monitoring
  • Optimization - Performance tuning, hybrid CPU+GPU
  • 量化指南 - GGUF格式、转换方法、质量对比
  • 服务器部署 - API端点、Docker、监控
  • 性能优化 - 性能调优、混合CPU+GPU方案

Resources

相关资源