huggingface-zerogpu

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Hugging Face ZeroGPU

Hugging Face ZeroGPU

Rules and patterns for ML demos on Hugging Face Spaces with ZeroGPU hardware. Covers
@spaces.GPU
, duration and quota tuning, process isolation, the CUDA availability model, concurrency safety, and CUDA build constraints.
在Hugging Face Spaces上使用ZeroGPU硬件部署机器学习演示的规则与模式。涵盖
@spaces.GPU
、时长与配额调优、进程隔离、CUDA可用性模型、并发安全及CUDA构建约束等内容。

Scope

适用范围

This skill is for Gradio SDK Spaces using ZeroGPU hardware. Docker and Static Spaces cannot schedule onto ZeroGPU, and Streamlit apps now run as Docker Spaces — so this skill applies only to Gradio. For general Gradio coding (components, layouts, event listeners), see the
huggingface-gradio
skill in this repo. The authoritative ZeroGPU docs live at https://huggingface.co/docs/hub/spaces-zerogpu — refer to them for the current backing GPU, runtime version lists, and tier thresholds, all of which change over time.
本技能适用于使用ZeroGPU硬件的Gradio SDK Spaces。Docker和静态Spaces无法调度到ZeroGPU,Streamlit应用现在以Docker Spaces运行——因此本技能仅适用于Gradio。如需通用Gradio编码指导(组件、布局、事件监听器),请查看本仓库中的
huggingface-gradio
技能。ZeroGPU的权威文档位于https://huggingface.co/docs/hub/spaces-zerogpu —— 请参考该文档获取当前底层GPU、运行时版本列表及层级阈值,这些内容会随时间变化。

Reference Files

参考文件

ReferenceWhen to read
references/concurrency.md
Always read alongside SKILL.md when writing ZeroGPU code — handlers run in parallel by default
references/how-zerogpu-works.md
When reasoning about cold-starts, worker reuse, why module-scope warmup does not carry to requests, or why returning CUDA tensors hangs
references/how-quota-works.md
When choosing
duration
values, debugging
illegal duration
vs
quota exceeded
errors, or explaining why default 60s blocks short tasks
references/cuda-and-deps.md
When installing CUDA-dependent packages (e.g.
flash-attn
), pinning torch side-cars, or reading wheel filename tags
参考文档阅读场景
references/concurrency.md
编写ZeroGPU代码时,务必与SKILL.md一同阅读——处理器默认并行运行
references/how-zerogpu-works.md
需理解冷启动、工作进程复用、为何模块级预热无法延续到请求、为何返回CUDA张量会挂起时阅读
references/how-quota-works.md
选择
duration
值、调试
illegal duration
quota exceeded
错误、解释默认60秒时长为何会阻止短任务时阅读
references/cuda-and-deps.md
安装CUDA依赖包(如
flash-attn
)、固定torch附属包、解读wheel文件名标签时阅读

Hardware

硬件规格

ZeroGPU exposes two GPU sizes that map to a fraction of the backing card:
size
Slice of backing GPUQuota cost
large
(default)
Half1x
xlarge
Full2x
Default
large
gives half a physical GPU, so memory bandwidth and compute are significantly lower than the full card's specs. Use
xlarge
only when the workload genuinely needs the extra memory or compute.
Backing GPU changes without notice. ZeroGPU has already migrated across GPU generations several times; older write-ups may name A100 or H200, but those are outdated. For the current backing GPU and exact per-size VRAM, always check the ZeroGPU docs before sizing workloads.
ZeroGPU提供两种GPU规格,对应底层显卡的不同切片:
size
底层GPU切片占比配额消耗
large
(默认)
一半1x
xlarge
完整显卡2x
默认的
large
规格提供一半物理GPU,因此内存带宽和计算能力远低于完整显卡的参数。仅当工作负载确实需要额外内存或计算能力时,才使用
xlarge
规格。
底层GPU会无预警变更。ZeroGPU已多次跨GPU代际迁移;旧文档可能提到A100或H200,但这些已过时。在规划工作负载规格前,请始终查看ZeroGPU文档获取当前底层GPU及各规格的精确VRAM。

Basic Pattern

基础模式

python
import spaces
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="...", device="cuda")

@spaces.GPU
def generate(prompt: str) -> str:
    return pipe(prompt, max_new_tokens=100)[0]["generated_text"]
Key rules:
  1. Instantiate models at module scope and call
    .to("cuda")
    eagerly. ZeroGPU handles the actual device mapping transparently (see CUDA availability model below).
  2. Decorate GPU functions with
    @spaces.GPU
    . The decorator is a no-op outside ZeroGPU, so it is safe to keep in all environments.
  3. Set
    duration
    to match the realistic worst-case workload
    (default 60s). The platform pre-checks
    requested duration
    against the user's
    remaining quota
    — not against the actual run time — so a 10-second task left at the 60s default fails with
    quota exceeded
    as soon as the user's remaining quota drops below 60s. Smaller declared
    duration
    also ranks higher in the node-level queue. See "Duration and Quota" below.
  4. torch.compile
    is NOT supported.
    Use PyTorch ahead-of-time compilation (AoTI) (torch 2.8+) instead.
  5. Use
    size="xlarge"
    sparingly.
    It allocates the full backing GPU, but costs 2x quota and tends to queue longer.
python
@spaces.GPU(duration=120)
def generate_image(prompt: str):
    return pipe(prompt).images[0]
python
import spaces
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="...", device="cuda")

@spaces.GPU
def generate(prompt: str) -> str:
    return pipe(prompt, max_new_tokens=100)[0]["generated_text"]
核心规则:
  1. 在模块级实例化模型,并主动调用
    .to("cuda")
    。ZeroGPU会透明处理实际设备映射(详见下文CUDA可用性模型)。
  2. @spaces.GPU
    装饰GPU相关函数
    。该装饰器在ZeroGPU外是无效操作,因此在所有环境中保留它都是安全的。
  3. 设置
    duration
    以匹配实际最坏情况的工作负载
    (默认60秒)。平台会预先检查
    请求时长
    与用户
    剩余配额
    ——而非实际运行时长——因此一个10秒的任务若保留默认60秒时长,当用户剩余配额低于60秒时会直接触发
    quota exceeded
    错误。更小的声明时长还能提升节点级队列中的优先级。详见下文「时长与配额」。
  4. 不支持
    torch.compile
    。请改用PyTorch的提前编译(AoTI)(torch 2.8+)。
  5. 谨慎使用
    size="xlarge"
    。它会占用完整底层GPU,但配额消耗翻倍且排队时间通常更长。
python
@spaces.GPU(duration=120)
def generate_image(prompt: str):
    return pipe(prompt).images[0]

CUDA Availability Model

CUDA可用性模型

Real GPU access is only available inside
@spaces.GPU
-decorated functions. Outside those functions, the GPU is not attached to the process.
However,
import spaces
monkey-patches
torch
so that:
  • torch.cuda.is_available()
    returns
    True
    globally.
  • .to("cuda")
    /
    device="cuda"
    calls at module scope succeed without error.
This is intentional. Module-scope
model.to("cuda")
calls register tensors with the ZeroGPU backend, which writes them to a disk offload directory at a startup "pack" step and frees the corresponding RAM. When a
@spaces.GPU
call lands, a forked GPU worker process streams those weights from disk into VRAM via a pinned-memory pipeline. Warm workers (reused across requests on the same GPU slot) keep weights resident on the GPU and skip the disk → VRAM step. The user-facing rule: write
device="cuda"
at module scope and it works — see
references/how-zerogpu-works.md
for the full lifecycle.
ActionWhereWhy
model.to("cuda")
/
pipe(..., device="cuda")
Module scopeZeroGPU registers the tensor and manages device migration
Actual CUDA computation (inference, etc.)Inside
@spaces.GPU
Real GPU is only attached during the decorated call
Branching on
torch.cuda.is_available()
Avoid relying on itAlways returns
True
due to the monkey-patch
Do not run inference or CUDA kernels at module scope — the real GPU is not attached, so operations either silently run on CPU or fail.
仅在
@spaces.GPU
装饰的函数内部才能访问真实GPU。在这些函数之外,进程未连接GPU。
import spaces
猴子补丁
torch
,使得:
  • torch.cuda.is_available()
    全局返回
    True
  • 模块级的
    .to("cuda")
    /
    device="cuda"
    调用可成功执行且无错误。
这是有意设计的。模块级的
model.to("cuda")
调用会向ZeroGPU后端注册张量,后端会在启动「打包」步骤将张量写入磁盘卸载目录并释放相应内存。当
@spaces.GPU
调用触发时,分叉的GPU工作进程会通过固定内存管道将这些权重从磁盘流式传输到VRAM。热工作进程(在同一GPU插槽上跨请求复用)会将权重保留在GPU上,跳过磁盘→VRAM步骤。面向用户的规则:在模块级编写
device="cuda"
即可正常工作——完整生命周期请查看
references/how-zerogpu-works.md
操作执行位置原因
model.to("cuda")
/
pipe(..., device="cuda")
模块级ZeroGPU注册张量并管理设备迁移
实际CUDA计算(推理等)
@spaces.GPU
内部
仅在装饰调用期间连接真实GPU
根据
torch.cuda.is_available()
分支
避免依赖该判断由于猴子补丁,它始终返回
True
请勿在模块级运行推理或CUDA内核——此时未连接真实GPU,操作要么在CPU上静默运行,要么直接失败。

Device selection idiom still works

设备选择惯用法依然有效

The standard idiom remains correct under ZeroGPU:
python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("...").to(device)
  • ZeroGPU
    is_available()
    is
    True
    (monkey-patched), so the model is registered for automatic device migration.
  • Dedicated GPU Spaces / local GPU
    is_available()
    is genuinely
    True
    .
  • CPU Spaces / local CPU — resolves to
    "cpu"
    .
Do not hardcode
device="cuda"
— it breaks on CPU-only environments.
标准设备选择惯用法在ZeroGPU下依然适用:
python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("...").to(device)
  • ZeroGPU环境 ——
    is_available()
    返回
    True
    (猴子补丁),模型会被注册以自动进行设备迁移。
  • 专用GPU Spaces / 本地GPU ——
    is_available()
    真实返回
    True
  • CPU Spaces / 本地CPU —— 解析为
    "cpu"
请勿硬编码
device="cuda"
——这会在仅CPU环境中失效。

Eager loading is the right default

主动加载是正确默认方式

Load models at module scope, not lazily on first request. The Space process starts before any user arrives, so cold-start cost is paid once. Lazy loading (
global model; if model is None: ...
,
@lru_cache
wrappers, factory functions instantiating on first call) just pushes that cost onto the first user.
在模块级加载模型,而非首次请求时延迟加载。Space进程在用户访问前就已启动,因此冷启动成本只需支付一次。延迟加载(
global model; if model is None: ...
@lru_cache
包装器、首次调用时实例化的工厂函数)只会将成本转移给第一个用户。

Local Development: Just Install
spaces

本地开发:只需安装
spaces

Do not wrap
import spaces
in
try/except
and redefine
spaces.GPU
as a no-op fallback for local runs. Off-ZeroGPU, the
spaces
package is already a true no-op:
  • Heavyweight behavior (CUDA monkey-patching, client init, startup hooks) is gated on the
    SPACES_ZERO_GPU
    env var, set only on ZeroGPU.
  • @spaces.GPU
    returns the undecorated function unchanged off-ZeroGPU.
  • Top-level
    import spaces
    performs only lightweight imports.
The Gradio SDK base image installs
spaces
on every hardware tier. So even after duplicating a Space onto a dedicated GPU (T4, L4, A10G, etc.) or CPU basic, no code changes are needed —
import spaces
still succeeds and
@spaces.GPU
becomes a transparent passthrough.
请勿将
import spaces
包裹在
try/except
中,并为本地运行重新定义
spaces.GPU
作为无效回退。在ZeroGPU外,
spaces
包本身就是真正的无效操作:
  • 重量级行为(CUDA猴子补丁、客户端初始化、启动钩子)受
    SPACES_ZERO_GPU
    环境变量控制,仅在ZeroGPU上设置该变量。
  • 在ZeroGPU外,
    @spaces.GPU
    会原样返回未装饰的函数。
  • 顶层
    import spaces
    仅执行轻量级导入。
Gradio SDK基础镜像会在所有硬件层级安装
spaces
。因此即使将Space复制到专用GPU(T4、L4、A10G等)或CPU基础版,也无需修改代码——
import spaces
仍可成功执行,
@spaces.GPU
会成为透明的直通装饰器。

Anti-pattern

反模式

python
try:
    import spaces
except ImportError:
    class spaces:  # type: ignore
        @staticmethod
        def GPU(func=None, **kwargs):
            return func if func else (lambda f: f)
Problems:
  1. The fallback must mimic every
    @spaces.GPU
    call shape — bare decorator,
    duration=...
    ,
    size=...
    , generators,
    aoti_*
    helpers — and drifts as the
    spaces
    API grows.
  2. It hides
    spaces
    from
    requirements.txt
    , even though the Space needs it at deploy time.
  3. It solves a non-problem: the real package is already a no-op locally.
python
try:
    import spaces
except ImportError:
    class spaces:  # type: ignore
        @staticmethod
        def GPU(func=None, **kwargs):
            return func if func else (lambda f: f)
问题:
  1. 回退代码必须模拟
    @spaces.GPU
    的所有调用形式——裸装饰器、
    duration=...
    size=...
    、生成器、
    aoti_*
    辅助函数——且会随
    spaces
    API的更新而失效。
  2. 它会将
    spaces
    requirements.txt
    中隐藏,尽管Space在部署时需要它。
  3. 它解决了一个不存在的问题:真实包在本地本身就是无效操作。

Do this instead

正确做法

Add
spaces
to dependencies and import it unconditionally:
python
import spaces

@spaces.GPU
def generate(prompt: str) -> str:
    ...
spaces
添加到依赖中并无条件导入:
python
import spaces

@spaces.GPU
def generate(prompt: str) -> str:
    ...

Duration and Quota

时长与配额

Three things happen when you declare
@spaces.GPU(duration=N)
:
  1. Tier-max check — each visitor tier has a per-call
    duration
    cap. Declaring
    duration
    larger than the cap fails immediately with
    ZeroGPU illegal duration
    , regardless of remaining quota. (Tier numbers change over time — see the ZeroGPU docs.)
  2. Quota pre-check — the platform compares
    requested duration
    against the user's
    remaining quota
    . If
    remaining < requested
    , the call fails with
    ZeroGPU quota exceeded
    — even if the actual work would have fit. The error message shows the explicit numbers, e.g.
    "60s requested vs. 30s left"
    . A 10-second task left at the default 60s therefore blocks the user once their remaining quota drops below 60s.
  3. Queue priority — the queue is node-level (requests from all Spaces on the same node compete for GPU slots), and shorter declared
    duration
    ranks higher.
All three favor declaring the smallest realistic
duration
— including for short tasks. Explicit
@spaces.GPU(duration=15)
on a 10-second task avoids premature
quota exceeded
rejections and ranks higher in the queue.
xlarge
doubles the request.
requested = N * 2
when
size="xlarge"
, both for the tier-max check and the quota pre-check. So
@spaces.GPU(duration=60, size="xlarge")
is internally a 120s request.
当声明
@spaces.GPU(duration=N)
时,会发生三件事:
  1. 层级上限检查 —— 每个访问者层级都有单次调用的
    duration
    上限。声明的
    duration
    超过上限会立即触发
    ZeroGPU illegal duration
    错误,与剩余配额无关。(层级数值会随时间变化——请查看ZeroGPU文档。)
  2. 配额预检查 —— 平台会将
    请求时长
    与用户
    剩余配额
    进行比较。若
    剩余配额 < 请求时长
    ,调用会触发
    ZeroGPU quota exceeded
    错误——即使实际工作负载本可容纳。错误消息会显示明确数值,例如
    "请求60秒 vs. 剩余30秒"
    。因此一个10秒的任务若保留默认60秒时长,当用户剩余配额低于60秒时会被阻止。
  3. 队列优先级 —— 队列是节点级的(同一节点上所有Spaces的请求会竞争GPU插槽),声明的时长越短,优先级越高。
以上三点都建议声明最小的合理
duration
——包括短任务。为10秒的任务显式设置
@spaces.GPU(duration=15)
可避免过早的
quota exceeded
拒绝,并提升队列优先级。
xlarge
规格会使请求翻倍
。当
size="xlarge"
时,
请求时长 = N * 2
,这适用于层级上限检查和配额预检查。因此
@spaces.GPU(duration=60, size="xlarge")
在内部会被视为120秒的请求。

Dynamic duration for variable workloads

可变工作负载的动态时长

For workloads whose runtime depends on inputs, pass a callable that estimates per request. A static high
duration
locks out low-tier users (whose tier cap may be smaller than the static value) and unnecessarily reserves quota for light inputs.
python
def estimate_duration(prompt, steps):
    return int(steps * 3.5)

@spaces.GPU(duration=estimate_duration)
def generate(prompt, steps):
    return pipe(prompt, num_inference_steps=steps).images[0]
For the full distinction between
illegal duration
vs
quota exceeded
, runs-per-day limits, the 24h quota window, and pay-as-you-go billing, see
references/how-quota-works.md
.
对于运行时取决于输入的工作负载,可传入一个可调用对象来估算每次请求的时长。静态的高
duration
会锁定低层级用户(其层级上限可能小于静态值),并为轻量输入不必要地占用配额。
python
def estimate_duration(prompt, steps):
    return int(steps * 3.5)

@spaces.GPU(duration=estimate_duration)
def generate(prompt, steps):
    return pipe(prompt, num_inference_steps=steps).images[0]
如需了解
illegal duration
quota exceeded
的完整区别、每日运行次数限制、24小时配额窗口及按需计费,请查看
references/how-quota-works.md

Process Isolation and Pickle

进程隔离与Pickle

@spaces.GPU
-decorated functions run in a separate process managed by the ZeroGPU scheduler. Arguments and return values cross the process boundary via pickle serialization.
Consequences:
  • Only picklable objects can be passed in or returned. Open file handles, database connections, locks, lambdas, and closures over unpicklable state will raise
    PicklingError
    .
  • Do NOT return CUDA tensors directly. Unpickling a CUDA tensor in the main process triggers
    torch.cuda._lazy_init()
    , which ZeroGPU blocks. Convert to CPU first: return
    tensor.cpu()
    or
    tensor.cpu().numpy()
    .
  • CPU tensors, numpy arrays, PIL Images, and plain Python objects work fine.
  • Large objects incur serialization overhead. Prefer lightweight returns (tensors, arrays, file paths, base64 strings) over complex object graphs.
@spaces.GPU
装饰的函数在ZeroGPU调度器管理的独立进程中运行。参数和返回值通过pickle序列化跨进程边界传递。
影响:
  • 仅可序列化对象可传入或返回。打开的文件句柄、数据库连接、锁、lambda表达式及包含不可序列化状态的闭包会触发
    PicklingError
  • 请勿直接返回CUDA张量。在主进程中反序列化CUDA张量会触发
    torch.cuda._lazy_init()
    ,而ZeroGPU会阻止此操作。请先转换为CPU张量:返回
    tensor.cpu()
    tensor.cpu().numpy()
  • CPU张量、numpy数组、PIL图像及普通Python对象可正常使用。
  • 大型对象会产生序列化开销。优先返回轻量级对象(张量、数组、文件路径、base64字符串),而非复杂对象图。

gr.State
semantics across the boundary

跨边界的
gr.State
语义

Because handlers run in a separate process,
gr.State
values are pickled on every yield — they are NOT shared by reference.
  • The generator receives a copy of the state (
    id()
    differs from the caller's).
  • In-place mutations inside the generator are invisible to other handlers until the mutated state is explicitly yielded back.
  • Yielding
    gr.update()
    for a
    gr.State
    slot skips the update — other handlers continue to see the pre-yield value.
  • Each yield that returns the state object creates a new copy via pickle.
Practical guidance:
  • Do NOT assume reference semantics for
    gr.State
    on ZeroGPU. Code that mutates state in a generator and expects another handler to see those mutations will silently use stale data.
  • Every yield including a
    gr.State
    value triggers a full pickle round-trip.
    For large state (model sessions, frame buffers), minimize how often you yield it — ideally once at the end. Use
    gr.update()
    for the state slot on intermediate yields.
  • CUDA tensors inside state must be moved to CPU before yielding — same
    torch.cuda._lazy_init()
    issue as above.
由于处理器在独立进程中运行,
gr.State
每次yield都会被序列化——它们并非按引用共享。
  • 生成器会接收状态的副本
    id()
    与调用者的不同)。
  • 生成器内部的原地变更对其他处理器不可见,直到变更后的状态被显式yield回去。
  • gr.State
    插槽yield
    gr.update()
    跳过更新——其他处理器会继续看到yield前的值。
  • 每次返回状态对象的yield都会通过pickle创建一个新副本
实用指导:
  • 请勿假设ZeroGPU上的
    gr.State
    具有引用语义
    。在生成器中修改状态并期望其他处理器看到变更的代码会静默使用过期数据。
  • 每次包含
    gr.State
    值的yield都会触发完整的pickle往返
    。对于大型状态(模型会话、帧缓冲区),请尽量减少yield次数——理想情况下仅在结束时yield一次。在中间yield时,为状态插槽使用
    gr.update()
  • 状态中的CUDA张量在yield前必须移至CPU——与上述
    torch.cuda._lazy_init()
    问题相同。

Concurrency

并发

Handlers run concurrently by default on ZeroGPU. This is not opt-in. Code that worked in single-user testing can silently corrupt or leak data in production.
Three rules. Full treatment with examples in
references/concurrency.md
.
  1. No mutable global state. Concurrent requests overwrite each other.
  2. No fixed file paths for outputs. Concurrent requests clobber the same file. Use
    tempfile
    for unique paths.
  3. Read-only globals are safe. Model objects, tokenizers, configs loaded once at startup and only read during requests are safe and encouraged.
ZeroGPU上的处理器默认并行运行。这是默认行为,无需手动开启。在单用户测试中正常工作的代码,在生产环境中可能会静默损坏数据或导致内存泄漏。
三条规则。完整说明及示例请查看
references/concurrency.md
  1. 禁止可变全局状态。并发请求会互相覆盖。
  2. 禁止使用固定文件路径作为输出。并发请求会覆盖同一文件。使用
    tempfile
    生成唯一路径。
  3. 只读全局状态是安全的。启动时加载一次且仅在请求期间读取的模型对象、分词器、配置是安全且推荐使用的。

Call Granularity

调用粒度

Each entry into a
@spaces.GPU
function carries non-trivial cost — pickle round-trip across the process boundary, worker warm-up, CUDA re-attach, and a fresh pass through the node-level queue. Calling a decorated function from inside a hot loop multiplies these costs and adds a new failure mode: a later iteration may fail to acquire a GPU slot, stalling the whole job mid-way.
Decorate the outer function that owns the loop, not the per-iteration worker:
python
undefined
每次进入
@spaces.GPU
函数都会产生不小的开销——跨进程边界的pickle往返、工作进程预热、CUDA重新连接,以及节点级队列的新轮次。在热循环内部调用装饰函数会成倍增加这些开销,并引入新的失败模式:后续迭代可能无法获取GPU插槽,导致整个任务中途停滞。
装饰包含循环的外部函数,而非每次迭代的工作函数:
python
undefined

Avoid — N GPU entries for N frames

避免 —— N帧需要N次GPU调用

def process_video(frames): return [process_frame(f) for f in frames]
@spaces.GPU(duration=...) def process_frame(frame): ...
def process_video(frames): return [process_frame(f) for f in frames]
@spaces.GPU(duration=...) def process_frame(frame): ...

Prefer — one GPU entry for the whole video

推荐 —— 整个视频只需一次GPU调用

@spaces.GPU(duration=...) def process_video(frames): return [process_frame(f) for f in frames]
def process_frame(frame): ...

If the loop mixes heavy CPU work with GPU work, wrapping the whole loop charges that CPU time against the user's quota. When that cost is material, batching the GPU work so CPU pre/post-processing stays outside the decorator is a situational optimization — not the default.
@spaces.GPU(duration=...) def process_video(frames): return [process_frame(f) for f in frames]
def process_frame(frame): ...

如果循环混合了繁重的CPU工作与GPU工作,包装整个循环会将CPU时间计入用户配额。当该成本显著时,可将GPU工作批量处理,使CPU预处理/后处理留在装饰器外——这是一种场景化优化,而非默认做法。

CUDA Build Constraints

CUDA构建约束

HF Spaces builds Docker images in a CPU-only environment. On ZeroGPU, the build phase has no
nvcc
because the base image is
python:3.13
(dedicated-GPU Spaces use
nvidia/cuda:*-devel-*
and have
nvcc
at build time). A CUDA-dependent package whose only distribution is sdist — e.g. bare
flash-attn
— therefore cannot be installed via
requirements.txt
on ZeroGPU. Only pre-built wheels work.
ZeroGPU runtime does have
nvcc
available, mounted from a CUDA devel image at
/cuda-image
since 2025-07 (originally added for AoTI support). This is what makes
torch.export
/ AoTI workflows possible inside
@spaces.GPU
calls.
Bottom line: install every CUDA-dependent package from a pre-built wheel. If no wheel is available on PyPI, build one externally (e.g. host on HF Hub) and pin the URL. For
flash-attn
, the upstream releases page ships a fairly complete wheel matrix covering most Python × CUDA × torch combinations.
For wheel-tag reading (cxx11 ABI,
cu12torch2.X
,
cp3XX
), torch-family side-car drift, and the kernels-community fallback, see
references/cuda-and-deps.md
.
HF Spaces在仅CPU环境中构建Docker镜像。ZeroGPU的构建阶段没有
nvcc
,因为基础镜像是
python:3.13
(专用GPU Spaces使用
nvidia/cuda:*-devel-*
镜像,构建时拥有
nvcc
)。因此,仅以sdist形式分发的CUDA依赖包(如裸
flash-attn
)无法通过
requirements.txt
在ZeroGPU上安装。仅预构建的wheel包可正常使用。
ZeroGPU的运行时拥有
nvcc
,自2025年7月起从CUDA开发镜像挂载到
/cuda-image
(最初为支持AoTI而添加)。这使得
torch.export
/ AoTI工作流可在
@spaces.GPU
调用内部实现。
总结:所有CUDA依赖包都需从预构建wheel安装。若PyPI上无可用wheel,可在外部构建(如托管在HF Hub)并固定URL。对于
flash-attn
,上游发布页面提供了覆盖大多数Python×CUDA×torch组合的完整wheel矩阵。
如需了解wheel标签解读(cxx11 ABI、
cu12torch2.X
cp3XX
)、torch家族附属包版本漂移及kernels-community替代方案,请查看
references/cuda-and-deps.md

Example Caching

示例缓存

gr.Examples
behavior is environment-dependent. On ZeroGPU specifically:
  • cache_examples
    defaults to
    True
    (Spaces sets
    GRADIO_CACHE_EXAMPLES=true
    ).
  • cache_mode
    defaults to
    "lazy"
    (Spaces sets
    GRADIO_CACHE_MODE=lazy
    only on ZeroGPU).
ZeroGPU defaults to
lazy
because eager caching pre-runs every example at app startup, but ZeroGPU has no GPU attached at startup — only during request handling. Eager caching of GPU-bound examples would fail there.
When
cache_examples=True
, the
run_on_click
/
run_examples_on_click
parameter is silently ignored. If your app relies on click-populates-only behavior, set
cache_examples=False
explicitly to preserve it.
To reproduce ZeroGPU example-caching behavior locally:
bash
GRADIO_CACHE_EXAMPLES=true GRADIO_CACHE_MODE=lazy python app.py
gr.Examples
的行为取决于环境。在ZeroGPU上:
  • cache_examples
    默认值为
    True
    (Spaces设置
    GRADIO_CACHE_EXAMPLES=true
    )。
  • cache_mode
    默认值为
    "lazy"
    (仅在ZeroGPU上,Spaces设置
    GRADIO_CACHE_MODE=lazy
    )。
ZeroGPU默认使用
lazy
模式,因为 eager缓存会在应用启动时预运行所有示例,但ZeroGPU在启动时未连接GPU——仅在请求处理期间连接。对GPU绑定示例进行eager缓存会在此处失败。
cache_examples=True
时,
run_on_click
/
run_examples_on_click
参数会被静默忽略。若应用依赖点击才填充的行为,请显式设置
cache_examples=False
以保留该行为。
如需在本地重现ZeroGPU的示例缓存行为:
bash
GRADIO_CACHE_EXAMPLES=true GRADIO_CACHE_MODE=lazy python app.py

Dependency Management

依赖管理

python_version
pin in README frontmatter

README前置元数据中的
python_version
固定

Pinning
python_version
is effectively required for ZeroGPU. The runtime default is currently Python 3.10, so a local environment using 3.11+ will fail to install on the Space without an explicit pin. Pin to a ZeroGPU-supported version (3.12 is a reasonable default); the authoritative supported list lives in the ZeroGPU docs — do not hardcode the full list, refer to the docs.
yaml
undefined
固定
python_version
对ZeroGPU实际上是必需的。当前运行时默认Python版本为3.10,因此使用3.11+的本地环境若不明确固定版本,在Space上会安装失败。固定为ZeroGPU支持的版本(3.12是合理的默认值);权威支持版本列表请查看ZeroGPU文档——请勿硬编码完整列表,请参考文档。
yaml
undefined

README.md frontmatter

README.md前置元数据

python_version: "3.12"

Both `"3.12"` and `"3.12.12"` forms are accepted.
python_version: "3.12"

`"3.12"`和`"3.12.12"`格式均被接受。

Do not pin
spaces
in
requirements.txt

请勿在
requirements.txt
中固定
spaces

The Space platform pins its own
spaces
version. A conflicting pin in
requirements.txt
causes pip resolution to fail at build time.
Rule: Do not include
spaces
in
requirements.txt
.
How to achieve this depends on your tooling:
  • Hand-written
    requirements.txt
    : simply omit
    spaces
    .
  • uv (
    pyproject.toml
    -managed): declare
    spaces
    in
    pyproject.toml
    so uv co-resolves transitive constraints (notably
    psutil
    , which
    spaces
    pins), then exclude it from the export:
    bash
    uv export --no-hashes --no-dev --no-emit-package spaces -o requirements.txt
    Without
    spaces
    in
    pyproject.toml
    , uv cannot see its transitive constraints and may resolve incompatible versions at build time.
  • pip-tools (
    pip-compile
    ) / Poetry: use the equivalent exclude mechanism.
Space平台会自行固定
spaces
版本。
requirements.txt
中的冲突固定会导致pip在构建时解析失败。
规则:请勿在
requirements.txt
中包含
spaces
实现方式取决于你的工具:
  • 手写
    requirements.txt
    :直接省略
    spaces
  • uv
    pyproject.toml
    管理):在
    pyproject.toml
    中声明
    spaces
    ,以便uv共同解析传递性约束(尤其是
    spaces
    固定的
    psutil
    ),然后在导出时排除它:
    bash
    uv export --no-hashes --no-dev --no-emit-package spaces -o requirements.txt
    pyproject.toml
    中无
    spaces
    ,uv无法看到其传递性约束,可能在构建时解析出不兼容版本。
  • pip-tools
    pip-compile
    )/ Poetry:使用等效的排除机制。

Pin
torch
to match wheel tags

固定
torch
以匹配wheel标签

If you install a CUDA-dependent wheel via direct URL, the wheel filename encodes the
torch
major.minor it was built against (e.g.
cu12torch2.8
). Pin
torch==X.Y.Z
in
requirements.txt
to match — otherwise pip may resolve
torch
to a different version and the Space fails on first import. Details and the kernels-community alternative are in
references/cuda-and-deps.md
.
若通过直接URL安装CUDA依赖wheel,wheel文件名会编码其构建所基于的
torch
主版本.次版本(如
cu12torch2.8
)。请在
requirements.txt
中固定
torch==X.Y.Z
以匹配——否则pip可能解析出不同版本的
torch
,导致Space首次导入时失败。详细信息及kernels-community替代方案请查看
references/cuda-and-deps.md