huggingface-zerogpu

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Hugging Face ZeroGPU

Rules and patterns for ML demos on Hugging Face Spaces with ZeroGPU hardware. Covers

@spaces.GPU

, duration and quota tuning, process isolation, the CUDA availability model, concurrency safety, and CUDA build constraints.

在Hugging Face Spaces上使用ZeroGPU硬件部署机器学习演示的规则与模式。涵盖

@spaces.GPU

、时长与配额调优、进程隔离、CUDA可用性模型、并发安全及CUDA构建约束等内容。

Scope

适用范围

This skill is for Gradio SDK Spaces using ZeroGPU hardware. Docker and Static Spaces cannot schedule onto ZeroGPU, and Streamlit apps now run as Docker Spaces — so this skill applies only to Gradio. For general Gradio coding (components, layouts, event listeners), see the

huggingface-gradio

skill in this repo. The authoritative ZeroGPU docs live at https://huggingface.co/docs/hub/spaces-zerogpu — refer to them for the current backing GPU, runtime version lists, and tier thresholds, all of which change over time.

本技能适用于使用ZeroGPU硬件的Gradio SDK Spaces。Docker和静态Spaces无法调度到ZeroGPU，Streamlit应用现在以Docker Spaces运行——因此本技能仅适用于Gradio。如需通用Gradio编码指导（组件、布局、事件监听器），请查看本仓库中的

huggingface-gradio

技能。ZeroGPU的权威文档位于https://huggingface.co/docs/hub/spaces-zerogpu —— 请参考该文档获取当前底层GPU、运行时版本列表及层级阈值，这些内容会随时间变化。

Reference Files

参考文件

Reference	When to read
`references/concurrency.md`	Always read alongside SKILL.md when writing ZeroGPU code — handlers run in parallel by default
`references/how-zerogpu-works.md`	When reasoning about cold-starts, worker reuse, why module-scope warmup does not carry to requests, or why returning CUDA tensors hangs
`references/how-quota-works.md`	When choosing `duration` values, debugging `illegal duration` vs `quota exceeded` errors, or explaining why default 60s blocks short tasks
`references/cuda-and-deps.md`	When installing CUDA-dependent packages (e.g. `flash-attn` ), pinning torch side-cars, or reading wheel filename tags

参考文档	阅读场景
`references/concurrency.md`	编写ZeroGPU代码时，务必与SKILL.md一同阅读——处理器默认并行运行
`references/how-zerogpu-works.md`	需理解冷启动、工作进程复用、为何模块级预热无法延续到请求、为何返回CUDA张量会挂起时阅读
`references/how-quota-works.md`	选择 `duration` 值、调试 `illegal duration` 与 `quota exceeded` 错误、解释默认60秒时长为何会阻止短任务时阅读
`references/cuda-and-deps.md`	安装CUDA依赖包（如 `flash-attn` ）、固定torch附属包、解读wheel文件名标签时阅读

Hardware

硬件规格

ZeroGPU exposes two GPU sizes that map to a fraction of the backing card:

`size`	Slice of backing GPU	Quota cost
`large` (default)	Half	1x
`xlarge`	Full	2x

Default

large

gives half a physical GPU, so memory bandwidth and compute are significantly lower than the full card's specs. Use

xlarge

only when the workload genuinely needs the extra memory or compute.

Backing GPU changes without notice. ZeroGPU has already migrated across GPU generations several times; older write-ups may name A100 or H200, but those are outdated. For the current backing GPU and exact per-size VRAM, always check the ZeroGPU docs before sizing workloads.

ZeroGPU提供两种GPU规格，对应底层显卡的不同切片：

`size`	底层GPU切片占比	配额消耗
`large` （默认）	一半	1x
`xlarge`	完整显卡	2x

默认的

large

规格提供一半物理GPU，因此内存带宽和计算能力远低于完整显卡的参数。仅当工作负载确实需要额外内存或计算能力时，才使用

xlarge

规格。

底层GPU会无预警变更。ZeroGPU已多次跨GPU代际迁移；旧文档可能提到A100或H200，但这些已过时。在规划工作负载规格前，请始终查看ZeroGPU文档获取当前底层GPU及各规格的精确VRAM。

Basic Pattern

基础模式

python

import spaces
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="...", device="cuda")

@spaces.GPU
def generate(prompt: str) -> str:
    return pipe(prompt, max_new_tokens=100)[0]["generated_text"]

Key rules:

Instantiate models at module scope and call
```
.to("cuda")
```
eagerly. ZeroGPU handles the actual device mapping transparently (see CUDA availability model below).
Decorate GPU functions with
@spaces.GPU
. The decorator is a no-op outside ZeroGPU, so it is safe to keep in all environments.
Set
duration
to match the realistic worst-case workload (default 60s). The platform pre-checks
```
requested duration
```
against the user's
```
remaining quota
```
— not against the actual run time — so a 10-second task left at the 60s default fails with
```
quota exceeded
```
as soon as the user's remaining quota drops below 60s. Smaller declared
```
duration
```
also ranks higher in the node-level queue. See "Duration and Quota" below.
torch.compile
is NOT supported. Use PyTorch ahead-of-time compilation (AoTI) (torch 2.8+) instead.
Use
size="xlarge"
sparingly. It allocates the full backing GPU, but costs 2x quota and tends to queue longer.

python

@spaces.GPU(duration=120)
def generate_image(prompt: str):
    return pipe(prompt).images[0]

python

import spaces
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="...", device="cuda")

@spaces.GPU
def generate(prompt: str) -> str:
    return pipe(prompt, max_new_tokens=100)[0]["generated_text"]

核心规则：

在模块级实例化模型，并主动调用
```
.to("cuda")
```
。ZeroGPU会透明处理实际设备映射（详见下文CUDA可用性模型）。
用
@spaces.GPU
装饰GPU相关函数。该装饰器在ZeroGPU外是无效操作，因此在所有环境中保留它都是安全的。
设置
duration
以匹配实际最坏情况的工作负载（默认60秒）。平台会预先检查
```
请求时长
```
与用户
```
剩余配额
```
——而非实际运行时长——因此一个10秒的任务若保留默认60秒时长，当用户剩余配额低于60秒时会直接触发
```
quota exceeded
```
错误。更小的声明时长还能提升节点级队列中的优先级。详见下文「时长与配额」。
不支持
torch.compile
。请改用PyTorch的提前编译（AoTI）（torch 2.8+）。
谨慎使用
size="xlarge"
。它会占用完整底层GPU，但配额消耗翻倍且排队时间通常更长。

python

@spaces.GPU(duration=120)
def generate_image(prompt: str):
    return pipe(prompt).images[0]

CUDA Availability Model

CUDA可用性模型

Real GPU access is only available inside

@spaces.GPU

-decorated functions. Outside those functions, the GPU is not attached to the process.

However,

import spaces

monkey-patches
torch
so that:

```
torch.cuda.is_available()
```
returns
```
True
```
globally.
```
.to("cuda")
```
/
```
device="cuda"
```
calls at module scope succeed without error.

This is intentional. Module-scope

model.to("cuda")

calls register tensors with the ZeroGPU backend, which writes them to a disk offload directory at a startup "pack" step and frees the corresponding RAM. When a

@spaces.GPU

call lands, a forked GPU worker process streams those weights from disk into VRAM via a pinned-memory pipeline. Warm workers (reused across requests on the same GPU slot) keep weights resident on the GPU and skip the disk → VRAM step. The user-facing rule: write

device="cuda"

at module scope and it works — see

references/how-zerogpu-works.md

for the full lifecycle.

Action	Where	Why
`model.to("cuda")` / `pipe(..., device="cuda")`	Module scope	ZeroGPU registers the tensor and manages device migration
Actual CUDA computation (inference, etc.)	Inside `@spaces.GPU`	Real GPU is only attached during the decorated call
Branching on `torch.cuda.is_available()`	Avoid relying on it	Always returns `True` due to the monkey-patch

Do not run inference or CUDA kernels at module scope — the real GPU is not attached, so operations either silently run on CPU or fail.

仅在

@spaces.GPU

装饰的函数内部才能访问真实GPU。在这些函数之外，进程未连接GPU。

但

import spaces

会猴子补丁
torch
，使得：

```
torch.cuda.is_available()
```
全局返回
```
True
```
。
模块级的
```
.to("cuda")
```
/
```
device="cuda"
```
调用可成功执行且无错误。

这是有意设计的。模块级的

model.to("cuda")

调用会向ZeroGPU后端注册张量，后端会在启动「打包」步骤将张量写入磁盘卸载目录并释放相应内存。当

@spaces.GPU

调用触发时，分叉的GPU工作进程会通过固定内存管道将这些权重从磁盘流式传输到VRAM。热工作进程（在同一GPU插槽上跨请求复用）会将权重保留在GPU上，跳过磁盘→VRAM步骤。面向用户的规则：在模块级编写

device="cuda"

即可正常工作——完整生命周期请查看

references/how-zerogpu-works.md

。

操作	执行位置	原因
`model.to("cuda")` / `pipe(..., device="cuda")`	模块级	ZeroGPU注册张量并管理设备迁移
实际CUDA计算（推理等）	`@spaces.GPU` 内部	仅在装饰调用期间连接真实GPU
根据 `torch.cuda.is_available()` 分支	避免依赖该判断	由于猴子补丁，它始终返回 `True`

请勿在模块级运行推理或CUDA内核——此时未连接真实GPU，操作要么在CPU上静默运行，要么直接失败。

Device selection idiom still works

设备选择惯用法依然有效

The standard idiom remains correct under ZeroGPU:

python

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("...").to(device)

ZeroGPU —
```
is_available()
```
is
```
True
```
(monkey-patched), so the model is registered for automatic device migration.
Dedicated GPU Spaces / local GPU —
```
is_available()
```
is genuinely
```
True
```
.
CPU Spaces / local CPU — resolves to
```
"cpu"
```
.

Do not hardcode

device="cuda"

— it breaks on CPU-only environments.

标准设备选择惯用法在ZeroGPU下依然适用：

python

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("...").to(device)

ZeroGPU环境 ——
```
is_available()
```
返回
```
True
```
（猴子补丁），模型会被注册以自动进行设备迁移。
专用GPU Spaces / 本地GPU ——
```
is_available()
```
真实返回
```
True
```
。
CPU Spaces / 本地CPU —— 解析为
```
"cpu"
```
。

请勿硬编码

device="cuda"

——这会在仅CPU环境中失效。

Eager loading is the right default

主动加载是正确默认方式

Load models at module scope, not lazily on first request. The Space process starts before any user arrives, so cold-start cost is paid once. Lazy loading (

global model; if model is None: ...

@lru_cache

wrappers, factory functions instantiating on first call) just pushes that cost onto the first user.

在模块级加载模型，而非首次请求时延迟加载。Space进程在用户访问前就已启动，因此冷启动成本只需支付一次。延迟加载（

global model; if model is None: ...

、

@lru_cache

包装器、首次调用时实例化的工厂函数）只会将成本转移给第一个用户。

Local Development: Just Install

spaces

本地开发：只需安装

spaces

Do not wrap

import spaces

try/except

and redefine

spaces.GPU

as a no-op fallback for local runs. Off-ZeroGPU, the

spaces

package is already a true no-op:

Heavyweight behavior (CUDA monkey-patching, client init, startup hooks) is gated on the
```
SPACES_ZERO_GPU
```
env var, set only on ZeroGPU.
```
@spaces.GPU
```
returns the undecorated function unchanged off-ZeroGPU.
Top-level
```
import spaces
```
performs only lightweight imports.

The Gradio SDK base image installs

spaces

on every hardware tier. So even after duplicating a Space onto a dedicated GPU (T4, L4, A10G, etc.) or CPU basic, no code changes are needed —

import spaces

still succeeds and

@spaces.GPU

becomes a transparent passthrough.

请勿将

import spaces

包裹在

try/except

中，并为本地运行重新定义

spaces.GPU

作为无效回退。在ZeroGPU外，

spaces

包本身就是真正的无效操作：

重量级行为（CUDA猴子补丁、客户端初始化、启动钩子）受
```
SPACES_ZERO_GPU
```
环境变量控制，仅在ZeroGPU上设置该变量。
在ZeroGPU外，
```
@spaces.GPU
```
会原样返回未装饰的函数。
顶层
```
import spaces
```
仅执行轻量级导入。

Gradio SDK基础镜像会在所有硬件层级安装

spaces

。因此即使将Space复制到专用GPU（T4、L4、A10G等）或CPU基础版，也无需修改代码——

import spaces

仍可成功执行，

@spaces.GPU

会成为透明的直通装饰器。

Anti-pattern

反模式

python

try:
    import spaces
except ImportError:
    class spaces:  # type: ignore
        @staticmethod
        def GPU(func=None, **kwargs):
            return func if func else (lambda f: f)

Problems:

The fallback must mimic every
```
@spaces.GPU
```
call shape — bare decorator,
```
duration=...
```
,
```
size=...
```
, generators,
```
aoti_*
```
helpers — and drifts as the
```
spaces
```
API grows.
It hides
```
spaces
```
from
```
requirements.txt
```
, even though the Space needs it at deploy time.
It solves a non-problem: the real package is already a no-op locally.

python

try:
    import spaces
except ImportError:
    class spaces:  # type: ignore
        @staticmethod
        def GPU(func=None, **kwargs):
            return func if func else (lambda f: f)

问题：

回退代码必须模拟
```
@spaces.GPU
```
的所有调用形式——裸装饰器、
```
duration=...
```
、
```
size=...
```
、生成器、
```
aoti_*
```
辅助函数——且会随
```
spaces
```
API的更新而失效。
它会将
```
spaces
```
从
```
requirements.txt
```
中隐藏，尽管Space在部署时需要它。
它解决了一个不存在的问题：真实包在本地本身就是无效操作。

Do this instead

正确做法

Add

spaces

to dependencies and import it unconditionally:

python

import spaces

@spaces.GPU
def generate(prompt: str) -> str:
    ...

将

spaces

添加到依赖中并无条件导入：

python

import spaces

@spaces.GPU
def generate(prompt: str) -> str:
    ...

Duration and Quota

时长与配额

Three things happen when you declare

@spaces.GPU(duration=N)

Tier-max check — each visitor tier has a per-call
```
duration
```
cap. Declaring
```
duration
```
larger than the cap fails immediately with
```
ZeroGPU illegal duration
```
, regardless of remaining quota. (Tier numbers change over time — see the ZeroGPU docs.)
Quota pre-check — the platform compares
```
requested duration
```
against the user's
```
remaining quota
```
. If
```
remaining < requested
```
, the call fails with
```
ZeroGPU quota exceeded
```
— even if the actual work would have fit. The error message shows the explicit numbers, e.g.
```
"60s requested vs. 30s left"
```
. A 10-second task left at the default 60s therefore blocks the user once their remaining quota drops below 60s.
Queue priority — the queue is node-level (requests from all Spaces on the same node compete for GPU slots), and shorter declared
```
duration
```
ranks higher.

All three favor declaring the smallest realistic

duration

— including for short tasks. Explicit

@spaces.GPU(duration=15)

on a 10-second task avoids premature

quota exceeded

rejections and ranks higher in the queue.

xlarge
doubles the request.
requested = N * 2
when
size="xlarge"
, both for the tier-max check and the quota pre-check. So
@spaces.GPU(duration=60, size="xlarge")
is internally a 120s request.

当声明

@spaces.GPU(duration=N)

时，会发生三件事：

层级上限检查 —— 每个访问者层级都有单次调用的
```
duration
```
上限。声明的
```
duration
```
超过上限会立即触发
```
ZeroGPU illegal duration
```
错误，与剩余配额无关。（层级数值会随时间变化——请查看ZeroGPU文档。）
配额预检查 —— 平台会将
```
请求时长
```
与用户
```
剩余配额
```
进行比较。若
```
剩余配额 < 请求时长
```
，调用会触发
```
ZeroGPU quota exceeded
```
错误——即使实际工作负载本可容纳。错误消息会显示明确数值，例如
```
"请求60秒 vs. 剩余30秒"
```
。因此一个10秒的任务若保留默认60秒时长，当用户剩余配额低于60秒时会被阻止。
队列优先级 —— 队列是节点级的（同一节点上所有Spaces的请求会竞争GPU插槽），声明的时长越短，优先级越高。

以上三点都建议声明最小的合理

duration

——包括短任务。为10秒的任务显式设置

@spaces.GPU(duration=15)

可避免过早的

quota exceeded

拒绝，并提升队列优先级。

xlarge
规格会使请求翻倍。当
size="xlarge"
时，
请求时长 = N * 2
，这适用于层级上限检查和配额预检查。因此
@spaces.GPU(duration=60, size="xlarge")
在内部会被视为120秒的请求。

Dynamic duration for variable workloads

可变工作负载的动态时长

For workloads whose runtime depends on inputs, pass a callable that estimates per request. A static high

duration

locks out low-tier users (whose tier cap may be smaller than the static value) and unnecessarily reserves quota for light inputs.

python

def estimate_duration(prompt, steps):
    return int(steps * 3.5)

@spaces.GPU(duration=estimate_duration)
def generate(prompt, steps):
    return pipe(prompt, num_inference_steps=steps).images[0]

For the full distinction between

illegal duration

quota exceeded

, runs-per-day limits, the 24h quota window, and pay-as-you-go billing, see

references/how-quota-works.md

对于运行时取决于输入的工作负载，可传入一个可调用对象来估算每次请求的时长。静态的高

duration

会锁定低层级用户（其层级上限可能小于静态值），并为轻量输入不必要地占用配额。

python

def estimate_duration(prompt, steps):
    return int(steps * 3.5)

@spaces.GPU(duration=estimate_duration)
def generate(prompt, steps):
    return pipe(prompt, num_inference_steps=steps).images[0]

如需了解

illegal duration

与

quota exceeded

的完整区别、每日运行次数限制、24小时配额窗口及按需计费，请查看

references/how-quota-works.md

。

Process Isolation and Pickle

进程隔离与Pickle

@spaces.GPU

-decorated functions run in a separate process managed by the ZeroGPU scheduler. Arguments and return values cross the process boundary via pickle serialization.

Consequences:

Only picklable objects can be passed in or returned. Open file handles, database connections, locks, lambdas, and closures over unpicklable state will raise
```
PicklingError
```
.
Do NOT return CUDA tensors directly. Unpickling a CUDA tensor in the main process triggers
```
torch.cuda._lazy_init()
```
, which ZeroGPU blocks. Convert to CPU first: return
```
tensor.cpu()
```
or
```
tensor.cpu().numpy()
```
.
CPU tensors, numpy arrays, PIL Images, and plain Python objects work fine.
Large objects incur serialization overhead. Prefer lightweight returns (tensors, arrays, file paths, base64 strings) over complex object graphs.

@spaces.GPU

装饰的函数在ZeroGPU调度器管理的独立进程中运行。参数和返回值通过pickle序列化跨进程边界传递。

影响：

仅可序列化对象可传入或返回。打开的文件句柄、数据库连接、锁、lambda表达式及包含不可序列化状态的闭包会触发
```
PicklingError
```
。
请勿直接返回CUDA张量。在主进程中反序列化CUDA张量会触发
```
torch.cuda._lazy_init()
```
，而ZeroGPU会阻止此操作。请先转换为CPU张量：返回
```
tensor.cpu()
```
或
```
tensor.cpu().numpy()
```
。
CPU张量、numpy数组、PIL图像及普通Python对象可正常使用。
大型对象会产生序列化开销。优先返回轻量级对象（张量、数组、文件路径、base64字符串），而非复杂对象图。

gr.State

semantics across the boundary

跨边界的

gr.State

语义

Because handlers run in a separate process,

gr.State

values are pickled on every yield — they are NOT shared by reference.

The generator receives a copy of the state (
```
id()
```
differs from the caller's).
In-place mutations inside the generator are invisible to other handlers until the mutated state is explicitly yielded back.
Yielding
```
gr.update()
```
for a
```
gr.State
```
slot skips the update — other handlers continue to see the pre-yield value.
Each yield that returns the state object creates a new copy via pickle.

Practical guidance:

Do NOT assume reference semantics for
gr.State
on ZeroGPU. Code that mutates state in a generator and expects another handler to see those mutations will silently use stale data.
Every yield including a
gr.State
value triggers a full pickle round-trip. For large state (model sessions, frame buffers), minimize how often you yield it — ideally once at the end. Use
```
gr.update()
```
for the state slot on intermediate yields.
CUDA tensors inside state must be moved to CPU before yielding — same
```
torch.cuda._lazy_init()
```
issue as above.

由于处理器在独立进程中运行，

gr.State

值每次yield都会被序列化——它们并非按引用共享。

生成器会接收状态的副本（
```
id()
```
与调用者的不同）。
生成器内部的原地变更对其他处理器不可见，直到变更后的状态被显式yield回去。
为
```
gr.State
```
插槽yield
```
gr.update()
```
会跳过更新——其他处理器会继续看到yield前的值。
每次返回状态对象的yield都会通过pickle创建一个新副本。

实用指导：

请勿假设ZeroGPU上的
gr.State
具有引用语义。在生成器中修改状态并期望其他处理器看到变更的代码会静默使用过期数据。
每次包含
gr.State
值的yield都会触发完整的pickle往返。对于大型状态（模型会话、帧缓冲区），请尽量减少yield次数——理想情况下仅在结束时yield一次。在中间yield时，为状态插槽使用
```
gr.update()
```
。
状态中的CUDA张量在yield前必须移至CPU——与上述
```
torch.cuda._lazy_init()
```
问题相同。

Concurrency

并发

Handlers run concurrently by default on ZeroGPU. This is not opt-in. Code that worked in single-user testing can silently corrupt or leak data in production.

Three rules. Full treatment with examples in

references/concurrency.md

No mutable global state. Concurrent requests overwrite each other.
No fixed file paths for outputs. Concurrent requests clobber the same file. Use
```
tempfile
```
for unique paths.
Read-only globals are safe. Model objects, tokenizers, configs loaded once at startup and only read during requests are safe and encouraged.

ZeroGPU上的处理器默认并行运行。这是默认行为，无需手动开启。在单用户测试中正常工作的代码，在生产环境中可能会静默损坏数据或导致内存泄漏。

三条规则。完整说明及示例请查看

references/concurrency.md

。

禁止可变全局状态。并发请求会互相覆盖。
禁止使用固定文件路径作为输出。并发请求会覆盖同一文件。使用
```
tempfile
```
生成唯一路径。
只读全局状态是安全的。启动时加载一次且仅在请求期间读取的模型对象、分词器、配置是安全且推荐使用的。

Call Granularity

调用粒度

Each entry into a

@spaces.GPU

function carries non-trivial cost — pickle round-trip across the process boundary, worker warm-up, CUDA re-attach, and a fresh pass through the node-level queue. Calling a decorated function from inside a hot loop multiplies these costs and adds a new failure mode: a later iteration may fail to acquire a GPU slot, stalling the whole job mid-way.

Decorate the outer function that owns the loop, not the per-iteration worker:

python

undefined

每次进入

@spaces.GPU

函数都会产生不小的开销——跨进程边界的pickle往返、工作进程预热、CUDA重新连接，以及节点级队列的新轮次。在热循环内部调用装饰函数会成倍增加这些开销，并引入新的失败模式：后续迭代可能无法获取GPU插槽，导致整个任务中途停滞。

装饰包含循环的外部函数，而非每次迭代的工作函数：

python

undefined

Avoid — N GPU entries for N frames

避免 —— N帧需要N次GPU调用

def process_video(frames): return [process_frame(f) for f in frames]

@spaces.GPU(duration=...) def process_frame(frame): ...

def process_video(frames): return [process_frame(f) for f in frames]

@spaces.GPU(duration=...) def process_frame(frame): ...

Prefer — one GPU entry for the whole video

README.md frontmatter

README.md前置元数据

python_version: "3.12"


Both `"3.12"` and `"3.12.12"` forms are accepted.

python_version: "3.12"


`"3.12"`和`"3.12.12"`格式均被接受。

Do not pin

spaces

requirements.txt

请勿在

requirements.txt

中固定

spaces

The Space platform pins its own

spaces

version. A conflicting pin in

requirements.txt

causes pip resolution to fail at build time.

Rule: Do not include
spaces
in
requirements.txt
.

How to achieve this depends on your tooling:

Hand-written
requirements.txt
: simply omit
```
spaces
```
.
uv (
```
pyproject.toml
```
-managed): declare
```
spaces
```
in
```
pyproject.toml
```
so uv co-resolves transitive constraints (notably
```
psutil
```
, which
```
spaces
```
pins), then exclude it from the export:
bash
```
uv export --no-hashes --no-dev --no-emit-package spaces -o requirements.txt
```
Without
```
spaces
```
in
```
pyproject.toml
```
, uv cannot see its transitive constraints and may resolve incompatible versions at build time.
pip-tools (
```
pip-compile
```
) / Poetry: use the equivalent exclude mechanism.

Space平台会自行固定

spaces

版本。

requirements.txt

中的冲突固定会导致pip在构建时解析失败。

规则：请勿在
requirements.txt
中包含
spaces
。

实现方式取决于你的工具：

手写
requirements.txt
：直接省略
```
spaces
```
。
uv（
```
pyproject.toml
```
管理）：在
```
pyproject.toml
```
中声明
```
spaces
```
，以便uv共同解析传递性约束（尤其是
```
spaces
```
固定的
```
psutil
```
），然后在导出时排除它：
bash
```
uv export --no-hashes --no-dev --no-emit-package spaces -o requirements.txt
```
若
```
pyproject.toml
```
中无
```
spaces
```
，uv无法看到其传递性约束，可能在构建时解析出不兼容版本。
pip-tools（
```
pip-compile
```
）/ Poetry：使用等效的排除机制。

torch

to match wheel tags

固定

torch

以匹配wheel标签

If you install a CUDA-dependent wheel via direct URL, the wheel filename encodes the

torch

major.minor it was built against (e.g.

cu12torch2.8

). Pin

torch==X.Y.Z

requirements.txt

to match — otherwise pip may resolve

torch

to a different version and the Space fails on first import. Details and the kernels-community alternative are in

references/cuda-and-deps.md

若通过直接URL安装CUDA依赖wheel，wheel文件名会编码其构建所基于的

torch

主版本.次版本（如

cu12torch2.8

）。请在

requirements.txt

中固定

torch==X.Y.Z

以匹配——否则pip可能解析出不同版本的

torch

，导致Space首次导入时失败。详细信息及kernels-community替代方案请查看

references/cuda-and-deps.md

。

huggingface-zerogpu

Original

Translation

Hugging Face ZeroGPU

Hugging Face ZeroGPU

Scope

适用范围

Reference Files

参考文件

Hardware

硬件规格

Basic Pattern

基础模式

CUDA Availability Model

CUDA可用性模型

Device selection idiom still works

设备选择惯用法依然有效

Eager loading is the right default

主动加载是正确默认方式

Local Development: Just Install spaces

本地开发：只需安装spaces

Anti-pattern

反模式

Do this instead

正确做法

Duration and Quota

时长与配额

Dynamic duration for variable workloads

可变工作负载的动态时长

Process Isolation and Pickle

进程隔离与Pickle

gr.State semantics across the boundary

跨边界的gr.State语义

Concurrency

并发

Call Granularity

调用粒度

Avoid — N GPU entries for N frames

避免 —— N帧需要N次GPU调用

Prefer — one GPU entry for the whole video

推荐 —— 整个视频只需一次GPU调用

CUDA Build Constraints

CUDA构建约束

Example Caching

示例缓存

Dependency Management

依赖管理

python_version pin in README frontmatter

README前置元数据中的python_version固定

README.md frontmatter

README.md前置元数据

Do not pin spaces in requirements.txt

请勿在requirements.txt中固定spaces

Pin torch to match wheel tags

固定torch以匹配wheel标签

Local Development: Just Install
`spaces`

本地开发：只需安装
`spaces`

`gr.State`
semantics across the boundary

跨边界的
`gr.State`
语义

`python_version`
pin in README frontmatter

README前置元数据中的
`python_version`
固定

Do not pin
`spaces`
in
`requirements.txt`

请勿在
`requirements.txt`
中固定
`spaces`

Pin
`torch`
to match wheel tags

固定
`torch`
以匹配wheel标签