llama-cpp
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesellama.cpp
llama.cpp
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
基于纯C/C++的LLM推理库,依赖项极少,针对CPU和非NVIDIA硬件进行了优化。
When to use llama.cpp
何时使用llama.cpp
Use llama.cpp when:
- Running on CPU-only machines
- Deploying on Apple Silicon (M1/M2/M3/M4)
- Using AMD or Intel GPUs (no CUDA)
- Edge deployment (Raspberry Pi, embedded systems)
- Need simple deployment without Docker/Python
Use TensorRT-LLM instead when:
- Have NVIDIA GPUs (A100/H100)
- Need maximum throughput (100K+ tok/s)
- Running in datacenter with CUDA
Use vLLM instead when:
- Have NVIDIA GPUs
- Need Python-first API
- Want PagedAttention
在以下场景使用llama.cpp:
- 在仅配备CPU的机器上运行
- 在Apple Silicon(M1/M2/M3/M4)设备上部署
- 使用AMD或Intel GPU(无CUDA支持)
- 边缘部署(树莓派、嵌入式系统)
- 需要无需Docker/Python的简易部署方案
在以下场景改用TensorRT-LLM:
- 拥有NVIDIA GPU(A100/H100)
- 需要最大吞吐量(100K+ 令牌/秒)
- 在配备CUDA的数据中心环境中运行
在以下场景改用vLLM:
- 拥有NVIDIA GPU
- 需要优先支持Python API
- 希望使用PagedAttention技术
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedmacOS/Linux
macOS/Linux
brew install llama.cpp
brew install llama.cpp
Or build from source
或者从源码构建
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
With Metal (Apple Silicon)
启用Metal(Apple Silicon)
make LLAMA_METAL=1
make LLAMA_METAL=1
With CUDA (NVIDIA)
启用CUDA(NVIDIA)
make LLAMA_CUDA=1
make LLAMA_CUDA=1
With ROCm (AMD)
启用ROCm(AMD)
make LLAMA_HIP=1
undefinedmake LLAMA_HIP=1
undefinedDownload model
下载模型
bash
undefinedbash
undefinedDownload from HuggingFace (GGUF format)
从HuggingFace下载(GGUF格式)
huggingface-cli download
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/
huggingface-cli download
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/
Or convert from HuggingFace
或者从HuggingFace格式转换
python convert_hf_to_gguf.py models/llama-2-7b-chat/
undefinedpython convert_hf_to_gguf.py models/llama-2-7b-chat/
undefinedRun inference
运行推理
bash
undefinedbash
undefinedSimple chat
简单对话
./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # Max tokens
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # Max tokens
./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # 最大令牌数
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # 最大令牌数
Interactive chat
交互式对话
./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive
undefined./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive
undefinedServer mode
服务器模式
bash
undefinedbash
undefinedStart OpenAI-compatible server
启动兼容OpenAI的服务器
./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # Offload 32 layers to GPU
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # Offload 32 layers to GPU
./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # 将32层卸载到GPU
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # 将32层卸载到GPU
Client request
客户端请求
curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
undefinedcurl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
undefinedQuantization formats
量化格式
GGUF format overview
GGUF格式概述
| Format | Bits | Size (7B) | Speed | Quality | Use Case |
|---|---|---|---|---|---|
| Q4_K_M | 4.5 | 4.1 GB | Fast | Good | Recommended default |
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
| 格式 | 位深 | 7B模型大小 | 速度 | 质量 | 适用场景 |
|---|---|---|---|---|---|
| Q4_K_M | 4.5 | 4.1 GB | 快 | 良好 | 推荐默认选项 |
| Q4_K_S | 4.3 | 3.9 GB | 更快 | 较低 | 对速度要求极高 |
| Q5_K_M | 5.5 | 4.8 GB | 中等 | 更好 | 对质量要求极高 |
| Q6_K | 6.5 | 5.5 GB | 较慢 | 最佳 | 追求最高质量 |
| Q8_0 | 8.0 | 7.0 GB | 慢 | 极佳 | 质量损失极小 |
| Q2_K | 2.5 | 2.7 GB | 最快 | 较差 | 仅用于测试 |
Choosing quantization
选择量化方案
bash
undefinedbash
undefinedGeneral use (balanced)
通用场景(平衡型)
Q4_K_M # 4-bit, medium quality
Q4_K_M # 4位,中等质量
Maximum speed (more degradation)
追求极致速度(质量损失较大)
Q2_K or Q3_K_M
Q2_K 或 Q3_K_M
Maximum quality (slower)
追求极致质量(速度较慢)
Q6_K or Q8_0
Q6_K 或 Q8_0
Very large models (70B, 405B)
超大规模模型(70B、405B)
Q3_K_M or Q4_K_S # Lower bits to fit in memory
undefinedQ3_K_M 或 Q4_K_S # 低位数以适配内存
undefinedHardware acceleration
硬件加速
Apple Silicon (Metal)
Apple Silicon(Metal)
bash
undefinedbash
undefinedBuild with Metal
启用Metal构建
make LLAMA_METAL=1
make LLAMA_METAL=1
Run with GPU acceleration (automatic)
启用GPU加速运行(自动生效)
./llama-cli -m model.gguf -ngl 999 # Offload all layers
./llama-cli -m model.gguf -ngl 999 # 卸载所有层
Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
性能表现:M3 Max 可达40-60令牌/秒(Llama 2-7B Q4_K_M)
undefinedundefinedNVIDIA GPUs (CUDA)
NVIDIA GPU(CUDA)
bash
undefinedbash
undefinedBuild with CUDA
启用CUDA构建
make LLAMA_CUDA=1
make LLAMA_CUDA=1
Offload layers to GPU
将部分层卸载到GPU
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
./llama-cli -m model.gguf -ngl 35 # 卸载35/40层
Hybrid CPU+GPU for large models
混合CPU+GPU运行大模型
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
undefined./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU:20层,CPU:剩余层
undefinedAMD GPUs (ROCm)
AMD GPU(ROCm)
bash
undefinedbash
undefinedBuild with ROCm
启用ROCm构建
make LLAMA_HIP=1
make LLAMA_HIP=1
Run with AMD GPU
使用AMD GPU运行
./llama-cli -m model.gguf -ngl 999
undefined./llama-cli -m model.gguf -ngl 999
undefinedCommon patterns
常见使用模式
Batch processing
批量处理
bash
undefinedbash
undefinedProcess multiple prompts from file
处理文件中的多个提示词
cat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100
-m model.gguf
--batch-size 512
-n 100
undefinedcat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100
-m model.gguf
--batch-size 512
-n 100
undefinedConstrained generation
约束生成
bash
undefinedbash
undefinedJSON output with grammar
生成符合JSON格式的结果
./llama-cli
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf
./llama-cli
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf
Outputs valid JSON only
仅输出合法JSON
undefinedundefinedContext size
上下文窗口大小
bash
undefinedbash
undefinedIncrease context (default 512)
增大上下文窗口(默认512)
./llama-cli
-m model.gguf
-c 4096 # 4K context window
-m model.gguf
-c 4096 # 4K context window
./llama-cli
-m model.gguf
-c 4096 # 4K上下文窗口
-m model.gguf
-c 4096 # 4K上下文窗口
Very long context (if model supports)
超长上下文(若模型支持)
./llama-cli -m model.gguf -c 32768 # 32K context
undefined./llama-cli -m model.gguf -c 32768 # 32K上下文
undefinedPerformance benchmarks
性能基准测试
CPU performance (Llama 2-7B Q4_K_M)
CPU性能(Llama 2-7B Q4_K_M)
| CPU | Threads | Speed | Cost |
|---|---|---|---|
| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
| CPU | 线程数 | 速度 | 成本 |
|---|---|---|---|
| Apple M3 Max | 16 | 50 令牌/秒 | 0美元(本地) |
| AMD Ryzen 9 7950X | 32 | 35 令牌/秒 | 0.50美元/小时 |
| Intel i9-13900K | 32 | 30 令牌/秒 | 0.40美元/小时 |
| AWS c7i.16xlarge | 64 | 40 令牌/秒 | 2.88美元/小时 |
GPU acceleration (Llama 2-7B Q4_K_M)
GPU加速性能(Llama 2-7B Q4_K_M)
| GPU | Speed | vs CPU | Cost |
|---|---|---|---|
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
| GPU | 速度 | 对比CPU | 成本 |
|---|---|---|---|
| NVIDIA RTX 4090 | 120 令牌/秒 | 3-4倍 | 0美元(本地) |
| NVIDIA A10 | 80 令牌/秒 | 2-3倍 | 1.00美元/小时 |
| AMD MI250 | 70 令牌/秒 | 2倍 | 2.00美元/小时 |
| Apple M3 Max(Metal) | 50 令牌/秒 | 近似 | 0美元(本地) |
Supported models
支持的模型
LLaMA family:
- Llama 2 (7B, 13B, 70B)
- Llama 3 (8B, 70B, 405B)
- Code Llama
Mistral family:
- Mistral 7B
- Mixtral 8x7B, 8x22B
Other:
- Falcon, BLOOM, GPT-J
- Phi-3, Gemma, Qwen
- LLaVA (vision), Whisper (audio)
Find models: https://huggingface.co/models?library=gguf
LLaMA系列:
- Llama 2(7B、13B、70B)
- Llama 3(8B、70B、405B)
- Code Llama
Mistral系列:
- Mistral 7B
- Mixtral 8x7B、8x22B
其他模型:
- Falcon、BLOOM、GPT-J
- Phi-3、Gemma、Qwen
- LLaVA(多模态)、Whisper(音频)
References
参考资料
- Quantization Guide - GGUF formats, conversion, quality comparison
- Server Deployment - API endpoints, Docker, monitoring
- Optimization - Performance tuning, hybrid CPU+GPU
- 量化指南 - GGUF格式、转换方法、质量对比
- 服务器部署 - API端点、Docker、监控
- 性能优化 - 性能调优、混合CPU+GPU方案