huggingface-zerogpu
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHugging Face ZeroGPU
Hugging Face ZeroGPU
Rules and patterns for ML demos on Hugging Face Spaces with ZeroGPU hardware. Covers , duration and quota tuning, process isolation, the CUDA availability model, concurrency safety, and CUDA build constraints.
@spaces.GPU在Hugging Face Spaces上使用ZeroGPU硬件部署机器学习演示的规则与模式。涵盖、时长与配额调优、进程隔离、CUDA可用性模型、并发安全及CUDA构建约束等内容。
@spaces.GPUScope
适用范围
This skill is for Gradio SDK Spaces using ZeroGPU hardware. Docker and Static Spaces cannot schedule onto ZeroGPU, and Streamlit apps now run as Docker Spaces — so this skill applies only to Gradio. For general Gradio coding (components, layouts, event listeners), see the skill in this repo. The authoritative ZeroGPU docs live at https://huggingface.co/docs/hub/spaces-zerogpu — refer to them for the current backing GPU, runtime version lists, and tier thresholds, all of which change over time.
huggingface-gradio本技能适用于使用ZeroGPU硬件的Gradio SDK Spaces。Docker和静态Spaces无法调度到ZeroGPU,Streamlit应用现在以Docker Spaces运行——因此本技能仅适用于Gradio。如需通用Gradio编码指导(组件、布局、事件监听器),请查看本仓库中的技能。ZeroGPU的权威文档位于https://huggingface.co/docs/hub/spaces-zerogpu —— 请参考该文档获取当前底层GPU、运行时版本列表及层级阈值,这些内容会随时间变化。
huggingface-gradioReference Files
参考文件
| Reference | When to read |
|---|---|
| Always read alongside SKILL.md when writing ZeroGPU code — handlers run in parallel by default |
| When reasoning about cold-starts, worker reuse, why module-scope warmup does not carry to requests, or why returning CUDA tensors hangs |
| When choosing |
| When installing CUDA-dependent packages (e.g. |
| 参考文档 | 阅读场景 |
|---|---|
| 编写ZeroGPU代码时,务必与SKILL.md一同阅读——处理器默认并行运行 |
| 需理解冷启动、工作进程复用、为何模块级预热无法延续到请求、为何返回CUDA张量会挂起时阅读 |
| 选择 |
| 安装CUDA依赖包(如 |
Hardware
硬件规格
ZeroGPU exposes two GPU sizes that map to a fraction of the backing card:
| Slice of backing GPU | Quota cost |
|---|---|---|
| Half | 1x |
| Full | 2x |
Default gives half a physical GPU, so memory bandwidth and compute are significantly lower than the full card's specs. Use only when the workload genuinely needs the extra memory or compute.
largexlargeBacking GPU changes without notice. ZeroGPU has already migrated across GPU generations several times; older write-ups may name A100 or H200, but those are outdated. For the current backing GPU and exact per-size VRAM, always check the ZeroGPU docs before sizing workloads.
ZeroGPU提供两种GPU规格,对应底层显卡的不同切片:
| 底层GPU切片占比 | 配额消耗 |
|---|---|---|
| 一半 | 1x |
| 完整显卡 | 2x |
默认的规格提供一半物理GPU,因此内存带宽和计算能力远低于完整显卡的参数。仅当工作负载确实需要额外内存或计算能力时,才使用规格。
largexlarge底层GPU会无预警变更。ZeroGPU已多次跨GPU代际迁移;旧文档可能提到A100或H200,但这些已过时。在规划工作负载规格前,请始终查看ZeroGPU文档获取当前底层GPU及各规格的精确VRAM。
Basic Pattern
基础模式
python
import spaces
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="...", device="cuda")
@spaces.GPU
def generate(prompt: str) -> str:
return pipe(prompt, max_new_tokens=100)[0]["generated_text"]Key rules:
- Instantiate models at module scope and call eagerly. ZeroGPU handles the actual device mapping transparently (see CUDA availability model below).
.to("cuda") - Decorate GPU functions with . The decorator is a no-op outside ZeroGPU, so it is safe to keep in all environments.
@spaces.GPU - Set to match the realistic worst-case workload (default 60s). The platform pre-checks
durationagainst the user'srequested duration— not against the actual run time — so a 10-second task left at the 60s default fails withremaining quotaas soon as the user's remaining quota drops below 60s. Smaller declaredquota exceededalso ranks higher in the node-level queue. See "Duration and Quota" below.duration - is NOT supported. Use PyTorch ahead-of-time compilation (AoTI) (torch 2.8+) instead.
torch.compile - Use sparingly. It allocates the full backing GPU, but costs 2x quota and tends to queue longer.
size="xlarge"
python
@spaces.GPU(duration=120)
def generate_image(prompt: str):
return pipe(prompt).images[0]python
import spaces
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="...", device="cuda")
@spaces.GPU
def generate(prompt: str) -> str:
return pipe(prompt, max_new_tokens=100)[0]["generated_text"]核心规则:
- 在模块级实例化模型,并主动调用。ZeroGPU会透明处理实际设备映射(详见下文CUDA可用性模型)。
.to("cuda") - 用装饰GPU相关函数。该装饰器在ZeroGPU外是无效操作,因此在所有环境中保留它都是安全的。
@spaces.GPU - 设置以匹配实际最坏情况的工作负载(默认60秒)。平台会预先检查
duration与用户请求时长——而非实际运行时长——因此一个10秒的任务若保留默认60秒时长,当用户剩余配额低于60秒时会直接触发剩余配额错误。更小的声明时长还能提升节点级队列中的优先级。详见下文「时长与配额」。quota exceeded - 不支持。请改用PyTorch的提前编译(AoTI)(torch 2.8+)。
torch.compile - 谨慎使用。它会占用完整底层GPU,但配额消耗翻倍且排队时间通常更长。
size="xlarge"
python
@spaces.GPU(duration=120)
def generate_image(prompt: str):
return pipe(prompt).images[0]CUDA Availability Model
CUDA可用性模型
Real GPU access is only available inside -decorated functions. Outside those functions, the GPU is not attached to the process.
@spaces.GPUHowever, monkey-patches so that:
import spacestorch- returns
torch.cuda.is_available()globally.True - /
.to("cuda")calls at module scope succeed without error.device="cuda"
This is intentional. Module-scope calls register tensors with the ZeroGPU backend, which writes them to a disk offload directory at a startup "pack" step and frees the corresponding RAM. When a call lands, a forked GPU worker process streams those weights from disk into VRAM via a pinned-memory pipeline. Warm workers (reused across requests on the same GPU slot) keep weights resident on the GPU and skip the disk → VRAM step. The user-facing rule: write at module scope and it works — see for the full lifecycle.
model.to("cuda")@spaces.GPUdevice="cuda"references/how-zerogpu-works.md| Action | Where | Why |
|---|---|---|
| Module scope | ZeroGPU registers the tensor and manages device migration |
| Actual CUDA computation (inference, etc.) | Inside | Real GPU is only attached during the decorated call |
Branching on | Avoid relying on it | Always returns |
Do not run inference or CUDA kernels at module scope — the real GPU is not attached, so operations either silently run on CPU or fail.
仅在装饰的函数内部才能访问真实GPU。在这些函数之外,进程未连接GPU。
@spaces.GPU但会猴子补丁,使得:
import spacestorch- 全局返回
torch.cuda.is_available()。True - 模块级的/
.to("cuda")调用可成功执行且无错误。device="cuda"
这是有意设计的。模块级的调用会向ZeroGPU后端注册张量,后端会在启动「打包」步骤将张量写入磁盘卸载目录并释放相应内存。当调用触发时,分叉的GPU工作进程会通过固定内存管道将这些权重从磁盘流式传输到VRAM。热工作进程(在同一GPU插槽上跨请求复用)会将权重保留在GPU上,跳过磁盘→VRAM步骤。面向用户的规则:在模块级编写即可正常工作——完整生命周期请查看。
model.to("cuda")@spaces.GPUdevice="cuda"references/how-zerogpu-works.md| 操作 | 执行位置 | 原因 |
|---|---|---|
| 模块级 | ZeroGPU注册张量并管理设备迁移 |
| 实际CUDA计算(推理等) | | 仅在装饰调用期间连接真实GPU |
根据 | 避免依赖该判断 | 由于猴子补丁,它始终返回 |
请勿在模块级运行推理或CUDA内核——此时未连接真实GPU,操作要么在CPU上静默运行,要么直接失败。
Device selection idiom still works
设备选择惯用法依然有效
The standard idiom remains correct under ZeroGPU:
python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("...").to(device)- ZeroGPU — is
is_available()(monkey-patched), so the model is registered for automatic device migration.True - Dedicated GPU Spaces / local GPU — is genuinely
is_available().True - CPU Spaces / local CPU — resolves to .
"cpu"
Do not hardcode — it breaks on CPU-only environments.
device="cuda"标准设备选择惯用法在ZeroGPU下依然适用:
python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("...").to(device)- ZeroGPU环境 —— 返回
is_available()(猴子补丁),模型会被注册以自动进行设备迁移。True - 专用GPU Spaces / 本地GPU —— 真实返回
is_available()。True - CPU Spaces / 本地CPU —— 解析为。
"cpu"
请勿硬编码——这会在仅CPU环境中失效。
device="cuda"Eager loading is the right default
主动加载是正确默认方式
Load models at module scope, not lazily on first request. The Space process starts before any user arrives, so cold-start cost is paid once. Lazy loading (, wrappers, factory functions instantiating on first call) just pushes that cost onto the first user.
global model; if model is None: ...@lru_cache在模块级加载模型,而非首次请求时延迟加载。Space进程在用户访问前就已启动,因此冷启动成本只需支付一次。延迟加载(、包装器、首次调用时实例化的工厂函数)只会将成本转移给第一个用户。
global model; if model is None: ...@lru_cacheLocal Development: Just Install spaces
spaces本地开发:只需安装spaces
spacesDo not wrap in and redefine as a no-op fallback for local runs. Off-ZeroGPU, the package is already a true no-op:
import spacestry/exceptspaces.GPUspaces- Heavyweight behavior (CUDA monkey-patching, client init, startup hooks) is gated on the env var, set only on ZeroGPU.
SPACES_ZERO_GPU - returns the undecorated function unchanged off-ZeroGPU.
@spaces.GPU - Top-level performs only lightweight imports.
import spaces
The Gradio SDK base image installs on every hardware tier. So even after duplicating a Space onto a dedicated GPU (T4, L4, A10G, etc.) or CPU basic, no code changes are needed — still succeeds and becomes a transparent passthrough.
spacesimport spaces@spaces.GPU请勿将包裹在中,并为本地运行重新定义作为无效回退。在ZeroGPU外,包本身就是真正的无效操作:
import spacestry/exceptspaces.GPUspaces- 重量级行为(CUDA猴子补丁、客户端初始化、启动钩子)受环境变量控制,仅在ZeroGPU上设置该变量。
SPACES_ZERO_GPU - 在ZeroGPU外,会原样返回未装饰的函数。
@spaces.GPU - 顶层仅执行轻量级导入。
import spaces
Gradio SDK基础镜像会在所有硬件层级安装。因此即使将Space复制到专用GPU(T4、L4、A10G等)或CPU基础版,也无需修改代码——仍可成功执行,会成为透明的直通装饰器。
spacesimport spaces@spaces.GPUAnti-pattern
反模式
python
try:
import spaces
except ImportError:
class spaces: # type: ignore
@staticmethod
def GPU(func=None, **kwargs):
return func if func else (lambda f: f)Problems:
- The fallback must mimic every call shape — bare decorator,
@spaces.GPU,duration=..., generators,size=...helpers — and drifts as theaoti_*API grows.spaces - It hides from
spaces, even though the Space needs it at deploy time.requirements.txt - It solves a non-problem: the real package is already a no-op locally.
python
try:
import spaces
except ImportError:
class spaces: # type: ignore
@staticmethod
def GPU(func=None, **kwargs):
return func if func else (lambda f: f)问题:
- 回退代码必须模拟的所有调用形式——裸装饰器、
@spaces.GPU、duration=...、生成器、size=...辅助函数——且会随aoti_*API的更新而失效。spaces - 它会将从
spaces中隐藏,尽管Space在部署时需要它。requirements.txt - 它解决了一个不存在的问题:真实包在本地本身就是无效操作。
Do this instead
正确做法
Add to dependencies and import it unconditionally:
spacespython
import spaces
@spaces.GPU
def generate(prompt: str) -> str:
...将添加到依赖中并无条件导入:
spacespython
import spaces
@spaces.GPU
def generate(prompt: str) -> str:
...Duration and Quota
时长与配额
Three things happen when you declare :
@spaces.GPU(duration=N)- Tier-max check — each visitor tier has a per-call cap. Declaring
durationlarger than the cap fails immediately withduration, regardless of remaining quota. (Tier numbers change over time — see the ZeroGPU docs.)ZeroGPU illegal duration - Quota pre-check — the platform compares against the user's
requested duration. Ifremaining quota, the call fails withremaining < requested— even if the actual work would have fit. The error message shows the explicit numbers, e.g.ZeroGPU quota exceeded. A 10-second task left at the default 60s therefore blocks the user once their remaining quota drops below 60s."60s requested vs. 30s left" - Queue priority — the queue is node-level (requests from all Spaces on the same node compete for GPU slots), and shorter declared ranks higher.
duration
All three favor declaring the smallest realistic — including for short tasks. Explicit on a 10-second task avoids premature rejections and ranks higher in the queue.
duration@spaces.GPU(duration=15)quota exceededdoubles the request.xlargewhenrequested = N * 2, both for the tier-max check and the quota pre-check. Sosize="xlarge"is internally a 120s request.@spaces.GPU(duration=60, size="xlarge")
当声明时,会发生三件事:
@spaces.GPU(duration=N)- 层级上限检查 —— 每个访问者层级都有单次调用的上限。声明的
duration超过上限会立即触发duration错误,与剩余配额无关。(层级数值会随时间变化——请查看ZeroGPU文档。)ZeroGPU illegal duration - 配额预检查 —— 平台会将与用户
请求时长进行比较。若剩余配额,调用会触发剩余配额 < 请求时长错误——即使实际工作负载本可容纳。错误消息会显示明确数值,例如ZeroGPU quota exceeded。因此一个10秒的任务若保留默认60秒时长,当用户剩余配额低于60秒时会被阻止。"请求60秒 vs. 剩余30秒" - 队列优先级 —— 队列是节点级的(同一节点上所有Spaces的请求会竞争GPU插槽),声明的时长越短,优先级越高。
以上三点都建议声明最小的合理——包括短任务。为10秒的任务显式设置可避免过早的拒绝,并提升队列优先级。
duration@spaces.GPU(duration=15)quota exceeded规格会使请求翻倍。当xlarge时,size="xlarge",这适用于层级上限检查和配额预检查。因此请求时长 = N * 2在内部会被视为120秒的请求。@spaces.GPU(duration=60, size="xlarge")
Dynamic duration for variable workloads
可变工作负载的动态时长
For workloads whose runtime depends on inputs, pass a callable that estimates per request. A static high locks out low-tier users (whose tier cap may be smaller than the static value) and unnecessarily reserves quota for light inputs.
durationpython
def estimate_duration(prompt, steps):
return int(steps * 3.5)
@spaces.GPU(duration=estimate_duration)
def generate(prompt, steps):
return pipe(prompt, num_inference_steps=steps).images[0]For the full distinction between vs , runs-per-day limits, the 24h quota window, and pay-as-you-go billing, see .
illegal durationquota exceededreferences/how-quota-works.md对于运行时取决于输入的工作负载,可传入一个可调用对象来估算每次请求的时长。静态的高会锁定低层级用户(其层级上限可能小于静态值),并为轻量输入不必要地占用配额。
durationpython
def estimate_duration(prompt, steps):
return int(steps * 3.5)
@spaces.GPU(duration=estimate_duration)
def generate(prompt, steps):
return pipe(prompt, num_inference_steps=steps).images[0]如需了解与的完整区别、每日运行次数限制、24小时配额窗口及按需计费,请查看。
illegal durationquota exceededreferences/how-quota-works.mdProcess Isolation and Pickle
进程隔离与Pickle
@spaces.GPUConsequences:
- Only picklable objects can be passed in or returned. Open file handles, database connections, locks, lambdas, and closures over unpicklable state will raise .
PicklingError - Do NOT return CUDA tensors directly. Unpickling a CUDA tensor in the main process triggers , which ZeroGPU blocks. Convert to CPU first: return
torch.cuda._lazy_init()ortensor.cpu().tensor.cpu().numpy() - CPU tensors, numpy arrays, PIL Images, and plain Python objects work fine.
- Large objects incur serialization overhead. Prefer lightweight returns (tensors, arrays, file paths, base64 strings) over complex object graphs.
@spaces.GPU影响:
- 仅可序列化对象可传入或返回。打开的文件句柄、数据库连接、锁、lambda表达式及包含不可序列化状态的闭包会触发。
PicklingError - 请勿直接返回CUDA张量。在主进程中反序列化CUDA张量会触发,而ZeroGPU会阻止此操作。请先转换为CPU张量:返回
torch.cuda._lazy_init()或tensor.cpu()。tensor.cpu().numpy() - CPU张量、numpy数组、PIL图像及普通Python对象可正常使用。
- 大型对象会产生序列化开销。优先返回轻量级对象(张量、数组、文件路径、base64字符串),而非复杂对象图。
gr.State
semantics across the boundary
gr.State跨边界的gr.State
语义
gr.StateBecause handlers run in a separate process, values are pickled on every yield — they are NOT shared by reference.
gr.State- The generator receives a copy of the state (differs from the caller's).
id() - In-place mutations inside the generator are invisible to other handlers until the mutated state is explicitly yielded back.
- Yielding for a
gr.update()slot skips the update — other handlers continue to see the pre-yield value.gr.State - Each yield that returns the state object creates a new copy via pickle.
Practical guidance:
- Do NOT assume reference semantics for on ZeroGPU. Code that mutates state in a generator and expects another handler to see those mutations will silently use stale data.
gr.State - Every yield including a value triggers a full pickle round-trip. For large state (model sessions, frame buffers), minimize how often you yield it — ideally once at the end. Use
gr.Statefor the state slot on intermediate yields.gr.update() - CUDA tensors inside state must be moved to CPU before yielding — same issue as above.
torch.cuda._lazy_init()
由于处理器在独立进程中运行,值每次yield都会被序列化——它们并非按引用共享。
gr.State- 生成器会接收状态的副本(与调用者的不同)。
id() - 生成器内部的原地变更对其他处理器不可见,直到变更后的状态被显式yield回去。
- 为插槽yield
gr.State会跳过更新——其他处理器会继续看到yield前的值。gr.update() - 每次返回状态对象的yield都会通过pickle创建一个新副本。
实用指导:
- 请勿假设ZeroGPU上的具有引用语义。在生成器中修改状态并期望其他处理器看到变更的代码会静默使用过期数据。
gr.State - 每次包含值的yield都会触发完整的pickle往返。对于大型状态(模型会话、帧缓冲区),请尽量减少yield次数——理想情况下仅在结束时yield一次。在中间yield时,为状态插槽使用
gr.State。gr.update() - 状态中的CUDA张量在yield前必须移至CPU——与上述问题相同。
torch.cuda._lazy_init()
Concurrency
并发
Handlers run concurrently by default on ZeroGPU. This is not opt-in. Code that worked in single-user testing can silently corrupt or leak data in production.
Three rules. Full treatment with examples in .
references/concurrency.md- No mutable global state. Concurrent requests overwrite each other.
- No fixed file paths for outputs. Concurrent requests clobber the same file. Use for unique paths.
tempfile - Read-only globals are safe. Model objects, tokenizers, configs loaded once at startup and only read during requests are safe and encouraged.
ZeroGPU上的处理器默认并行运行。这是默认行为,无需手动开启。在单用户测试中正常工作的代码,在生产环境中可能会静默损坏数据或导致内存泄漏。
三条规则。完整说明及示例请查看。
references/concurrency.md- 禁止可变全局状态。并发请求会互相覆盖。
- 禁止使用固定文件路径作为输出。并发请求会覆盖同一文件。使用生成唯一路径。
tempfile - 只读全局状态是安全的。启动时加载一次且仅在请求期间读取的模型对象、分词器、配置是安全且推荐使用的。
Call Granularity
调用粒度
Each entry into a function carries non-trivial cost — pickle round-trip across the process boundary, worker warm-up, CUDA re-attach, and a fresh pass through the node-level queue. Calling a decorated function from inside a hot loop multiplies these costs and adds a new failure mode: a later iteration may fail to acquire a GPU slot, stalling the whole job mid-way.
@spaces.GPUDecorate the outer function that owns the loop, not the per-iteration worker:
python
undefined每次进入函数都会产生不小的开销——跨进程边界的pickle往返、工作进程预热、CUDA重新连接,以及节点级队列的新轮次。在热循环内部调用装饰函数会成倍增加这些开销,并引入新的失败模式:后续迭代可能无法获取GPU插槽,导致整个任务中途停滞。
@spaces.GPU装饰包含循环的外部函数,而非每次迭代的工作函数:
python
undefinedAvoid — N GPU entries for N frames
避免 —— N帧需要N次GPU调用
def process_video(frames):
return [process_frame(f) for f in frames]
@spaces.GPU(duration=...)
def process_frame(frame):
...
def process_video(frames):
return [process_frame(f) for f in frames]
@spaces.GPU(duration=...)
def process_frame(frame):
...
Prefer — one GPU entry for the whole video
推荐 —— 整个视频只需一次GPU调用
@spaces.GPU(duration=...)
def process_video(frames):
return [process_frame(f) for f in frames]
def process_frame(frame):
...
If the loop mixes heavy CPU work with GPU work, wrapping the whole loop charges that CPU time against the user's quota. When that cost is material, batching the GPU work so CPU pre/post-processing stays outside the decorator is a situational optimization — not the default.@spaces.GPU(duration=...)
def process_video(frames):
return [process_frame(f) for f in frames]
def process_frame(frame):
...
如果循环混合了繁重的CPU工作与GPU工作,包装整个循环会将CPU时间计入用户配额。当该成本显著时,可将GPU工作批量处理,使CPU预处理/后处理留在装饰器外——这是一种场景化优化,而非默认做法。CUDA Build Constraints
CUDA构建约束
HF Spaces builds Docker images in a CPU-only environment. On ZeroGPU, the build phase has no because the base image is (dedicated-GPU Spaces use and have at build time). A CUDA-dependent package whose only distribution is sdist — e.g. bare — therefore cannot be installed via on ZeroGPU. Only pre-built wheels work.
nvccpython:3.13nvidia/cuda:*-devel-*nvccflash-attnrequirements.txtZeroGPU runtime does have available, mounted from a CUDA devel image at since 2025-07 (originally added for AoTI support). This is what makes / AoTI workflows possible inside calls.
nvcc/cuda-imagetorch.export@spaces.GPUBottom line: install every CUDA-dependent package from a pre-built wheel. If no wheel is available on PyPI, build one externally (e.g. host on HF Hub) and pin the URL. For , the upstream releases page ships a fairly complete wheel matrix covering most Python × CUDA × torch combinations.
flash-attnFor wheel-tag reading (cxx11 ABI, , ), torch-family side-car drift, and the kernels-community fallback, see .
cu12torch2.Xcp3XXreferences/cuda-and-deps.mdHF Spaces在仅CPU环境中构建Docker镜像。ZeroGPU的构建阶段没有,因为基础镜像是(专用GPU Spaces使用镜像,构建时拥有)。因此,仅以sdist形式分发的CUDA依赖包(如裸)无法通过在ZeroGPU上安装。仅预构建的wheel包可正常使用。
nvccpython:3.13nvidia/cuda:*-devel-*nvccflash-attnrequirements.txtZeroGPU的运行时拥有,自2025年7月起从CUDA开发镜像挂载到(最初为支持AoTI而添加)。这使得 / AoTI工作流可在调用内部实现。
nvcc/cuda-imagetorch.export@spaces.GPU总结:所有CUDA依赖包都需从预构建wheel安装。若PyPI上无可用wheel,可在外部构建(如托管在HF Hub)并固定URL。对于,上游发布页面提供了覆盖大多数Python×CUDA×torch组合的完整wheel矩阵。
flash-attn如需了解wheel标签解读(cxx11 ABI、、)、torch家族附属包版本漂移及kernels-community替代方案,请查看。
cu12torch2.Xcp3XXreferences/cuda-and-deps.mdExample Caching
示例缓存
gr.Examples- defaults to
cache_examples(Spaces setsTrue).GRADIO_CACHE_EXAMPLES=true - defaults to
cache_mode(Spaces sets"lazy"only on ZeroGPU).GRADIO_CACHE_MODE=lazy
ZeroGPU defaults to because eager caching pre-runs every example at app startup, but ZeroGPU has no GPU attached at startup — only during request handling. Eager caching of GPU-bound examples would fail there.
lazyWhen , the / parameter is silently ignored. If your app relies on click-populates-only behavior, set explicitly to preserve it.
cache_examples=Truerun_on_clickrun_examples_on_clickcache_examples=FalseTo reproduce ZeroGPU example-caching behavior locally:
bash
GRADIO_CACHE_EXAMPLES=true GRADIO_CACHE_MODE=lazy python app.pygr.Examples- 默认值为
cache_examples(Spaces设置True)。GRADIO_CACHE_EXAMPLES=true - 默认值为
cache_mode(仅在ZeroGPU上,Spaces设置"lazy")。GRADIO_CACHE_MODE=lazy
ZeroGPU默认使用模式,因为 eager缓存会在应用启动时预运行所有示例,但ZeroGPU在启动时未连接GPU——仅在请求处理期间连接。对GPU绑定示例进行eager缓存会在此处失败。
lazy当时, / 参数会被静默忽略。若应用依赖点击才填充的行为,请显式设置以保留该行为。
cache_examples=Truerun_on_clickrun_examples_on_clickcache_examples=False如需在本地重现ZeroGPU的示例缓存行为:
bash
GRADIO_CACHE_EXAMPLES=true GRADIO_CACHE_MODE=lazy python app.pyDependency Management
依赖管理
python_version
pin in README frontmatter
python_versionREADME前置元数据中的python_version
固定
python_versionPinning is effectively required for ZeroGPU. The runtime default is currently Python 3.10, so a local environment using 3.11+ will fail to install on the Space without an explicit pin. Pin to a ZeroGPU-supported version (3.12 is a reasonable default); the authoritative supported list lives in the ZeroGPU docs — do not hardcode the full list, refer to the docs.
python_versionyaml
undefined固定对ZeroGPU实际上是必需的。当前运行时默认Python版本为3.10,因此使用3.11+的本地环境若不明确固定版本,在Space上会安装失败。固定为ZeroGPU支持的版本(3.12是合理的默认值);权威支持版本列表请查看ZeroGPU文档——请勿硬编码完整列表,请参考文档。
python_versionyaml
undefinedREADME.md frontmatter
README.md前置元数据
python_version: "3.12"
Both `"3.12"` and `"3.12.12"` forms are accepted.python_version: "3.12"
`"3.12"`和`"3.12.12"`格式均被接受。Do not pin spaces
in requirements.txt
spacesrequirements.txt请勿在requirements.txt
中固定spaces
requirements.txtspacesThe Space platform pins its own version. A conflicting pin in causes pip resolution to fail at build time.
spacesrequirements.txtRule: Do not includeinspaces.requirements.txt
How to achieve this depends on your tooling:
- Hand-written : simply omit
requirements.txt.spaces - uv (-managed): declare
pyproject.tomlinspacesso uv co-resolves transitive constraints (notablypyproject.toml, whichpsutilpins), then exclude it from the export:spacesWithoutbashuv export --no-hashes --no-dev --no-emit-package spaces -o requirements.txtinspaces, uv cannot see its transitive constraints and may resolve incompatible versions at build time.pyproject.toml - pip-tools () / Poetry: use the equivalent exclude mechanism.
pip-compile
Space平台会自行固定版本。中的冲突固定会导致pip在构建时解析失败。
spacesrequirements.txt规则:请勿在中包含requirements.txt。spaces
实现方式取决于你的工具:
- 手写:直接省略
requirements.txt。spaces - uv(管理):在
pyproject.toml中声明pyproject.toml,以便uv共同解析传递性约束(尤其是spaces固定的spaces),然后在导出时排除它:psutil若bashuv export --no-hashes --no-dev --no-emit-package spaces -o requirements.txt中无pyproject.toml,uv无法看到其传递性约束,可能在构建时解析出不兼容版本。spaces - pip-tools()/ Poetry:使用等效的排除机制。
pip-compile
Pin torch
to match wheel tags
torch固定torch
以匹配wheel标签
torchIf you install a CUDA-dependent wheel via direct URL, the wheel filename encodes the major.minor it was built against (e.g. ). Pin in to match — otherwise pip may resolve to a different version and the Space fails on first import. Details and the kernels-community alternative are in .
torchcu12torch2.8torch==X.Y.Zrequirements.txttorchreferences/cuda-and-deps.md若通过直接URL安装CUDA依赖wheel,wheel文件名会编码其构建所基于的主版本.次版本(如)。请在中固定以匹配——否则pip可能解析出不同版本的,导致Space首次导入时失败。详细信息及kernels-community替代方案请查看。
torchcu12torch2.8requirements.txttorch==X.Y.Ztorchreferences/cuda-and-deps.md