hf-mem

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

hf_mem

estimates the required memory for inference, including model weights and an optional KV cache, for Safetensors and GGUF for models on the Hugging Face Hub using HTTP Range requests i.e., without downloading or loading any weights locally.

hf_mem

可估算推理所需的内存，包括模型权重和可选的KV缓存，针对Hugging Face Hub上的Safetensors和GGUF模型，它通过HTTP Range请求实现，无需在本地下载或加载任何权重。

When to use?

适用场景？

User asks how much VRAM or memory a model needs to run
User wants to know if a model fits on their GPU or a given instance
User references a Hugging Face model ID or URL and asks about inference requirements

用户询问运行某个模型需要多少VRAM或内存
用户想了解某个模型是否能适配自己的GPU或指定实例
用户引用Hugging Face模型ID或URL，询问推理资源要求

What are the requirements?

前置要求？

```
uv
```
installed (for
```
uvx
```
)
```
HF_TOKEN
```
env var or
```
--hf-token
```
flag (for gated or private models only)

已安装
```
uv
```
（用于
```
uvx
```
）
```
HF_TOKEN
```
环境变量或
```
--hf-token
```
标志（仅针对受限或私有模型）

How to run?

如何运行？

Run with

--model-id

pointing to the Hugging Face Hub repository which will check that it either contains Safetensors (via

model.safetensors

model.safetensors.index.json

if sharded, or

model_index.json

for Diffusers) or GGUF model weights within.

bash

uvx hf-mem --model-id <model-id> --json-output

If the repository contains GGUF model weights in multiple precisions / quantizations, the estimations will be on a per-file basis, whereas for inference you won't load all of those but rather only a single precision. This being said, for GGUF you might as well need to provide

--gguf-file

to target the specific file (or path if sharded) you want to run.

bash

uvx hf-mem --model-id <model-id> --gguf-file <file-or-path> --json-output

Additionally,

hf-mem

comes with an

--experimental

flag that will also calculate the KV cache memory requirements too, useful for large-language models, meaning it applies to LLMs (

...ForCausalLM

), VLMs (

...ForConditionalGeneration

), and GGUF models.

As per the context window, it will be read from the default or overridden with

--max-model-len

a la vLLM. And, same goes for the KV cache precision, which will default to the model precision unless manually set via

--kv-cache-dtype

a la vLLM too.

For Safetensors use as:

bash

uvx hf-mem --model-id <model-id> --experimental [--max-model-len N] [--batch-size N] [--kv-cache-dtype auto|bfloat16|fp8|fp8_ds_mla|fp8_e4m3|fp8_e5m2|fp8_inc] --json-output

And, for GGUF use as:

bash

uvx hf-mem --model-id <model-id> --gguf-file <file-or-path> --experimental [--max-model-len N] [--batch-size N] [--kv-cache-dtype auto|F32|F16|Q4_0|Q4_1|Q5_0|Q5_1|Q8_0|Q8_1|Q2_K|Q3_K|Q4_K|Q5_K|Q6_K|Q8_K|IQ2_XXS|IQ2_XS|IQ3_XXS|IQ1_S|IQ4_NL|IQ3_S|IQ2_S|IQ4_XS|I8|I16|I32|I64|F64|IQ1_M|BF16|TQ1_0|TQ2_0|MXFP4] --json-output

使用

--model-id

指定Hugging Face Hub仓库，工具会检查仓库中是否包含Safetensors（通过

model.safetensors

、分片情况下的

model.safetensors.index.json

，或Diffusers的

model_index.json

）或GGUF模型权重。

bash

uvx hf-mem --model-id <model-id> --json-output

如果仓库中包含多种精度/量化的GGUF模型权重，估算结果会按单个文件展示，但实际推理时不会加载所有文件，只会加载单一精度的文件。因此，针对GGUF模型，你可能需要通过

--gguf-file

指定要运行的特定文件（或分片路径）。

bash

uvx hf-mem --model-id <model-id> --gguf-file <file-or-path> --json-output

此外，

hf-mem

提供

--experimental

标志，可同时计算KV缓存的内存需求，这对大语言模型很有用，适用于LLM（

...ForCausalLM

）、VLM（

...ForConditionalGeneration

）和GGUF模型。

关于上下文窗口，工具会读取默认值，也可通过

--max-model-len

手动覆盖，类似vLLM的用法。同样，KV缓存精度默认与模型精度一致，也可通过

--kv-cache-dtype

手动设置，同样参考vLLM的用法。

针对Safetensors的使用方式：

bash

uvx hf-mem --model-id <model-id> --experimental [--max-model-len N] [--batch-size N] [--kv-cache-dtype auto|bfloat16|fp8|fp8_ds_mla|fp8_e4m3|fp8_e5m2|fp8_inc] --json-output

针对GGUF的使用方式：

bash

uvx hf-mem --model-id <model-id> --gguf-file <file-or-path> --experimental [--max-model-len N] [--batch-size N] [--kv-cache-dtype auto|F32|F16|Q4_0|Q4_1|Q5_0|Q5_1|Q8_0|Q8_1|Q2_K|Q3_K|Q4_K|Q5_K|Q6_K|Q8_K|IQ2_XXS|IQ2_XS|IQ3_XXS|IQ1_S|IQ4_NL|IQ3_S|IQ2_S|IQ4_XS|I8|I16|I32|I64|F64|IQ1_M|BF16|TQ1_0|TQ2_0|MXFP4] --json-output

Examples

示例

For Transformers with Safetensors weights:

bash

uvx hf-mem --model-id MiniMaxAI/MiniMax-M2 --json-output

For Diffusers with Safetensors weights:

bash

uvx hf-mem --model-id Qwen/Qwen-Image --json-output

For Sentence Transformers with Safetensors weights:

bash

uvx hf-mem --model-id google/embeddinggemma-300m --json-output

With

--experimental

to include the KV cache estimation for LLMs and VLMs:

bash

uvx hf-mem --model-id mistralai/Mistral-7B-v0.1 --experimental --json-output

And, for LLMs or VLMs with GGUF weights:

bash

uvx hf-mem --model-id unsloth/Qwen3.5-397B-A17B-GGUF --gguf-file Q4_K_M --experimental --json-output

针对使用Safetensors权重的Transformers模型：

bash

uvx hf-mem --model-id MiniMaxAI/MiniMax-M2 --json-output

针对使用Safetensors权重的Diffusers模型：

bash

uvx hf-mem --model-id Qwen/Qwen-Image --json-output

针对使用Safetensors权重的Sentence Transformers模型：

bash

uvx hf-mem --model-id google/embeddinggemma-300m --json-output

使用

--experimental

标志估算LLM和VLM的KV缓存内存：

bash

uvx hf-mem --model-id mistralai/Mistral-7B-v0.1 --experimental --json-output

针对使用GGUF权重的LLM或VLM：

bash

uvx hf-mem --model-id unsloth/Qwen3.5-397B-A17B-GGUF --gguf-file Q4_K_M --experimental --json-output