llama-cpp

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

基于纯C/C++的LLM推理库，依赖项极少，针对CPU和非NVIDIA硬件进行了优化。

When to use llama.cpp

何时使用llama.cpp

Use llama.cpp when:

Running on CPU-only machines
Deploying on Apple Silicon (M1/M2/M3/M4)
Using AMD or Intel GPUs (no CUDA)
Edge deployment (Raspberry Pi, embedded systems)
Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

Have NVIDIA GPUs (A100/H100)
Need maximum throughput (100K+ tok/s)
Running in datacenter with CUDA

Use vLLM instead when:

Have NVIDIA GPUs
Need Python-first API
Want PagedAttention

在以下场景使用llama.cpp：

在仅配备CPU的机器上运行
在Apple Silicon（M1/M2/M3/M4）设备上部署
使用AMD或Intel GPU（无CUDA支持）
边缘部署（树莓派、嵌入式系统）
需要无需Docker/Python的简易部署方案

在以下场景改用TensorRT-LLM：

拥有NVIDIA GPU（A100/H100）
需要最大吞吐量（100K+ 令牌/秒）
在配备CUDA的数据中心环境中运行

在以下场景改用vLLM：

拥有NVIDIA GPU
需要优先支持Python API
希望使用PagedAttention技术

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

macOS/Linux

brew install llama.cpp

Or build from source

或者从源码构建

git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make

With Metal (Apple Silicon)

启用Metal（Apple Silicon）

make LLAMA_METAL=1

With CUDA (NVIDIA)

启用CUDA（NVIDIA）

make LLAMA_CUDA=1

With ROCm (AMD)

启用ROCm（AMD）

make LLAMA_HIP=1

undefined

make LLAMA_HIP=1

undefined

Download model

下载模型

bash

undefined

bash

undefined

Download from HuggingFace (GGUF format)

从HuggingFace下载（GGUF格式）

huggingface-cli download
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/

Or convert from HuggingFace

或者从HuggingFace格式转换

python convert_hf_to_gguf.py models/llama-2-7b-chat/

undefined

python convert_hf_to_gguf.py models/llama-2-7b-chat/

undefined

Run inference

运行推理

bash

undefined

bash

undefined

Simple chat

简单对话

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # Max tokens

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # 最大令牌数

Interactive chat

交互式对话

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive

undefined

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive

undefined

Server mode

服务器模式

bash

undefined

bash

undefined

Start OpenAI-compatible server

启动兼容OpenAI的服务器

./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # Offload 32 layers to GPU

./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # 将32层卸载到GPU

Client request

客户端请求

curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'

undefined

undefined

Quantization formats

量化格式

GGUF format overview

GGUF格式概述

Format	Bits	Size (7B)	Speed	Quality	Use Case
Q4_K_M	4.5	4.1 GB	Fast	Good	Recommended default
Q4_K_S	4.3	3.9 GB	Faster	Lower	Speed critical
Q5_K_M	5.5	4.8 GB	Medium	Better	Quality critical
Q6_K	6.5	5.5 GB	Slower	Best	Maximum quality
Q8_0	8.0	7.0 GB	Slow	Excellent	Minimal degradation
Q2_K	2.5	2.7 GB	Fastest	Poor	Testing only

格式	位深	7B模型大小	速度	质量	适用场景
Q4_K_M	4.5	4.1 GB	快	良好	推荐默认选项
Q4_K_S	4.3	3.9 GB	更快	较低	对速度要求极高
Q5_K_M	5.5	4.8 GB	中等	更好	对质量要求极高
Q6_K	6.5	5.5 GB	较慢	最佳	追求最高质量
Q8_0	8.0	7.0 GB	慢	极佳	质量损失极小
Q2_K	2.5	2.7 GB	最快	较差	仅用于测试

Choosing quantization

选择量化方案

bash

undefined

bash

undefined

General use (balanced)

通用场景（平衡型）

Q4_K_M # 4-bit, medium quality

Q4_K_M # 4位，中等质量

Maximum speed (more degradation)

追求极致速度（质量损失较大）

Q2_K or Q3_K_M

Q2_K 或 Q3_K_M

Maximum quality (slower)

追求极致质量（速度较慢）

Q6_K or Q8_0

Q6_K 或 Q8_0

Very large models (70B, 405B)

超大规模模型（70B、405B）

Q3_K_M or Q4_K_S # Lower bits to fit in memory

undefined

Q3_K_M 或 Q4_K_S # 低位数以适配内存

undefined

Hardware acceleration

硬件加速

Apple Silicon (Metal)

Apple Silicon（Metal）

bash

undefined

bash

undefined

Build with Metal

启用Metal构建

make LLAMA_METAL=1

Run with GPU acceleration (automatic)

启用GPU加速运行（自动生效）

./llama-cli -m model.gguf -ngl 999 # Offload all layers

./llama-cli -m model.gguf -ngl 999 # 卸载所有层

Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

性能表现：M3 Max 可达40-60令牌/秒（Llama 2-7B Q4_K_M）

undefined

undefined

NVIDIA GPUs (CUDA)

NVIDIA GPU（CUDA）

bash

undefined

bash

undefined

Build with CUDA

启用CUDA构建

make LLAMA_CUDA=1

Offload layers to GPU

将部分层卸载到GPU

./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers

./llama-cli -m model.gguf -ngl 35 # 卸载35/40层

Hybrid CPU+GPU for large models

混合CPU+GPU运行大模型

./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest

undefined

./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU：20层，CPU：剩余层

undefined

AMD GPUs (ROCm)

AMD GPU（ROCm）

bash

undefined

bash

undefined

Build with ROCm

启用ROCm构建

make LLAMA_HIP=1

Run with AMD GPU

使用AMD GPU运行

./llama-cli -m model.gguf -ngl 999

undefined

./llama-cli -m model.gguf -ngl 999

undefined

Common patterns

常见使用模式

Batch processing

批量处理

bash

undefined

bash

undefined

Process multiple prompts from file

处理文件中的多个提示词

cat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100

undefined

cat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100

undefined

Constrained generation

约束生成

bash

undefined

bash

undefined

JSON output with grammar

生成符合JSON格式的结果

./llama-cli
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf

Outputs valid JSON only

仅输出合法JSON

undefined

undefined

Context size

上下文窗口大小

bash

undefined

bash

undefined

Increase context (default 512)

增大上下文窗口（默认512）

./llama-cli
-m model.gguf
-c 4096 # 4K context window

./llama-cli
-m model.gguf
-c 4096 # 4K上下文窗口

Very long context (if model supports)

超长上下文（若模型支持）

./llama-cli -m model.gguf -c 32768 # 32K context

undefined

./llama-cli -m model.gguf -c 32768 # 32K上下文

undefined

Performance benchmarks

性能基准测试

CPU performance (Llama 2-7B Q4_K_M)

CPU性能（Llama 2-7B Q4_K_M）

CPU	Threads	Speed	Cost
Apple M3 Max	16	50 tok/s	$0 (local)
AMD Ryzen 9 7950X	32	35 tok/s	$0.50/hour
Intel i9-13900K	32	30 tok/s	$0.40/hour
AWS c7i.16xlarge	64	40 tok/s	$2.88/hour

CPU	线程数	速度	成本
Apple M3 Max	16	50 令牌/秒	0美元（本地）
AMD Ryzen 9 7950X	32	35 令牌/秒	0.50美元/小时
Intel i9-13900K	32	30 令牌/秒	0.40美元/小时
AWS c7i.16xlarge	64	40 令牌/秒	2.88美元/小时

GPU acceleration (Llama 2-7B Q4_K_M)

GPU加速性能（Llama 2-7B Q4_K_M）

GPU	Speed	vs CPU	Cost
NVIDIA RTX 4090	120 tok/s	3-4×	$0 (local)
NVIDIA A10	80 tok/s	2-3×	$1.00/hour
AMD MI250	70 tok/s	2×	$2.00/hour
Apple M3 Max (Metal)	50 tok/s	~Same	$0 (local)

GPU	速度	对比CPU	成本
NVIDIA RTX 4090	120 令牌/秒	3-4倍	0美元（本地）
NVIDIA A10	80 令牌/秒	2-3倍	1.00美元/小时
AMD MI250	70 令牌/秒	2倍	2.00美元/小时
Apple M3 Max（Metal）	50 令牌/秒	近似	0美元（本地）

Supported models

支持的模型

LLaMA family:

Llama 2 (7B, 13B, 70B)
Llama 3 (8B, 70B, 405B)
Code Llama

Mistral family:

Mistral 7B
Mixtral 8x7B, 8x22B

Other:

Falcon, BLOOM, GPT-J
Phi-3, Gemma, Qwen
LLaVA (vision), Whisper (audio)

Find models: https://huggingface.co/models?library=gguf

LLaMA系列:

Llama 2（7B、13B、70B）
Llama 3（8B、70B、405B）
Code Llama

Mistral系列:

Mistral 7B
Mixtral 8x7B、8x22B

其他模型:

Falcon、BLOOM、GPT-J
Phi-3、Gemma、Qwen
LLaVA（多模态）、Whisper（音频）

查找模型: https://huggingface.co/models?library=gguf

References

参考资料

Quantization Guide - GGUF formats, conversion, quality comparison
Server Deployment - API endpoints, Docker, monitoring
Optimization - Performance tuning, hybrid CPU+GPU

量化指南 - GGUF格式、转换方法、质量对比
服务器部署 - API端点、Docker、监控
性能优化 - 性能调优、混合CPU+GPU方案