huggingface-jobs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Running Workloads on Hugging Face Jobs

在Hugging Face Jobs上运行工作负载

Overview

概述

Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.
Common use cases:
  • Data Processing - Transform, filter, or analyze large datasets
  • Batch Inference - Run inference on thousands of samples
  • Experiments & Benchmarks - Reproducible ML experiments
  • Model Training - Fine-tune models (see
    model-trainer
    skill for TRL-specific training)
  • Synthetic Data Generation - Generate datasets using LLMs
  • Development & Testing - Test code without local GPU setup
  • Scheduled Jobs - Automate recurring tasks
For model training specifically: See the
model-trainer
skill for TRL-based training workflows.
在全托管的Hugging Face基础设施上运行任意工作负载。无需本地设置——作业在云端CPU、GPU或TPU上运行,结果可持久化到Hugging Face Hub。
常见用例:
  • 数据处理 - 转换、过滤或分析大型数据集
  • 批处理推理 - 对数千个样本运行推理
  • 实验与基准测试 - 可复现的机器学习实验
  • 模型训练 - 微调模型(针对基于TRL的特定训练,请查看
    model-trainer
    技能)
  • 合成数据生成 - 使用大语言模型生成数据集
  • 开发与测试 - 无需本地GPU即可测试代码
  • 定时作业 - 自动化重复任务
针对模型训练的说明: 基于TRL的训练工作流请使用
model-trainer
技能。

When to Use This Skill

何时使用本技能

Use this skill when users want to:
  • Run Python workloads on cloud infrastructure
  • Execute jobs without local GPU/TPU setup
  • Process data at scale
  • Run batch inference or experiments
  • Schedule recurring tasks
  • Use GPUs/TPUs for any workload
  • Persist results to the Hugging Face Hub
当用户希望执行以下操作时,使用本技能:
  • 在云端基础设施上运行Python工作负载
  • 无需本地GPU/TPU即可执行作业
  • 大规模处理数据
  • 运行批处理推理或实验
  • 定时执行重复任务
  • 为任意工作负载使用GPU/TPU
  • 将结果持久化到Hugging Face Hub

Key Directives

核心指导原则

When assisting with jobs:
  1. ALWAYS use
    hf_jobs()
    MCP tool
    - Submit jobs using
    hf_jobs("uv", {...})
    or
    hf_jobs("run", {...})
    . The
    script
    parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to
    hf_jobs()
    .
  2. Always handle authentication - Jobs that interact with the Hub require
    HF_TOKEN
    via secrets. See Token Usage section below.
  3. Provide job details after submission - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
  4. Set appropriate timeouts - Default 30min may be insufficient for long-running tasks.
协助处理作业时:
  1. 务必使用
    hf_jobs()
    MCP工具
    - 使用
    hf_jobs("uv", {...})
    hf_jobs("run", {...})
    提交作业。
    script
    参数可直接接收Python代码。除非用户明确要求,否则不要保存到本地文件。将脚本内容以字符串形式传递给
    hf_jobs()
  2. 务必处理认证 - 与Hub交互的作业需要通过密钥传入
    HF_TOKEN
    。请查看下方的令牌使用章节。
  3. 提交后提供作业详情 - 提交完成后,提供作业ID、监控URL、预计时长,并告知用户之后可请求状态检查。
  4. 设置合适的超时时间 - 默认30分钟可能不足以处理长时间运行的任务。

Prerequisites Checklist

前置检查清单

Before starting any job, verify:
启动任何作业前,请确认:

Account & Authentication

账户与认证

  • Hugging Face Account with Pro, Team, or Enterprise plan (Jobs require paid plan)
  • Authenticated login: Check with
    hf_whoami()
  • HF_TOKEN for Hub Access ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
  • Token must have appropriate permissions (read for downloads, write for uploads)
  • 拥有Pro、Team或Enterprise计划的Hugging Face账户(Jobs功能需要付费计划)
  • 已完成认证登录:使用
    hf_whoami()
    检查
  • 用于Hub访问的HF_TOKEN ⚠️ 关键 - 任何Hub操作(推送模型/数据集、下载私有仓库等)都需要该令牌
  • 令牌必须具备相应权限(下载需要读权限,上传需要写权限)

Token Usage (See Token Usage section for details)

令牌使用(详情请查看令牌使用章节)

When tokens are required:
  • Pushing models/datasets to Hub
  • Accessing private repositories
  • Using Hub APIs in scripts
  • Any authenticated Hub operations
How to provide tokens:
python
undefined
需要令牌的场景:
  • 向Hub推送模型/数据集
  • 访问私有仓库
  • 在脚本中使用Hub API
  • 任何需要认证的Hub操作
如何传入令牌:
python
undefined

hf_jobs MCP tool — $HF_TOKEN is auto-replaced with real token:

hf_jobs MCP工具 — $HF_TOKEN会自动替换为实际令牌:

{"secrets": {"HF_TOKEN": "$HF_TOKEN"}}
{"secrets": {"HF_TOKEN": "$HF_TOKEN"}}

HfApi().run_uv_job() — MUST pass actual token:

HfApi().run_uv_job() — 必须传入实际令牌:

from huggingface_hub import get_token secrets={"HF_TOKEN": get_token()}

**⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is ONLY auto-replaced by the `hf_jobs` MCP tool. When using `HfApi().run_uv_job()`, you MUST pass the real token via `get_token()`. Passing the literal string `"$HF_TOKEN"` results in a 9-character invalid token and 401 errors.
from huggingface_hub import get_token secrets={"HF_TOKEN": get_token()}

**⚠️ 重要提示:** `$HF_TOKEN`占位符仅会被`hf_jobs` MCP工具自动替换。使用`HfApi().run_uv_job()`时,必须通过`get_token()`传入实际令牌。传入字面字符串`"$HF_TOKEN"`会导致使用9字符的无效令牌,进而触发401错误。

Token Usage Guide

令牌使用指南

Understanding Tokens

理解令牌

What are HF Tokens?
  • Authentication credentials for Hugging Face Hub
  • Required for authenticated operations (push, private repos, API access)
  • Stored securely on your machine after
    hf auth login
Token Types:
  • Read Token - Can download models/datasets, read private repos
  • Write Token - Can push models/datasets, create repos, modify content
  • Organization Token - Can act on behalf of an organization
什么是HF令牌?
  • Hugging Face Hub的认证凭据
  • 执行认证操作(推送、私有仓库、API访问)时必需
  • 在执行
    hf auth login
    后安全存储在本地机器上
令牌类型:
  • 读令牌 - 可下载模型/数据集、读取私有仓库
  • 写令牌 - 可推送模型/数据集、创建仓库、修改内容
  • 组织令牌 - 可代表组织执行操作

When Tokens Are Required

何时需要令牌

Always Required:
  • Pushing models/datasets to Hub
  • Accessing private repositories
  • Creating new repositories
  • Modifying existing repositories
  • Using Hub APIs programmatically
Not Required:
  • Downloading public models/datasets
  • Running jobs that don't interact with Hub
  • Reading public repository information
始终需要的场景:
  • 向Hub推送模型/数据集
  • 访问私有仓库
  • 创建新仓库
  • 修改现有仓库
  • 以编程方式使用Hub API
不需要的场景:
  • 下载公开模型/数据集
  • 运行不与Hub交互的作业
  • 读取公开仓库信息

How to Provide Tokens to Jobs

如何为作业提供令牌

Method 1: Automatic Token (Recommended)

方法1:自动令牌(推荐)

python
hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Automatic replacement
})
How it works:
  • $HF_TOKEN
    is a placeholder that gets replaced with your actual token
  • Uses the token from your logged-in session (
    hf auth login
    )
  • Most secure and convenient method
  • Token is encrypted server-side when passed as a secret
Benefits:
  • No token exposure in code
  • Uses your current login session
  • Automatically updated if you re-login
  • Works seamlessly with MCP tools
python
hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ 自动替换
})
工作原理:
  • $HF_TOKEN
    是一个占位符,会被替换为你的实际令牌
  • 使用你登录会话中的令牌(
    hf auth login
  • 最安全且便捷的方式
  • 作为密钥传入时,令牌会在服务器端加密
优势:
  • 代码中不会暴露令牌
  • 使用当前登录会话
  • 重新登录后会自动更新
  • 与MCP工具无缝协作

Method 2: Explicit Token (Not Recommended)

方法2:显式令牌(不推荐)

python
hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Hardcoded token
})
When to use:
  • Only if automatic token doesn't work
  • Testing with a specific token
  • Organization tokens (use with caution)
Security concerns:
  • Token visible in code/logs
  • Must manually update if token rotates
  • Risk of token exposure
python
hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ 硬编码令牌
})
适用场景:
  • 仅在自动令牌方式无法工作时使用
  • 使用特定令牌进行测试
  • 组织令牌(谨慎使用)
安全隐患:
  • 令牌会在代码/日志中可见
  • 令牌轮换时必须手动更新
  • 存在令牌泄露风险

Method 3: Environment Variable (Less Secure)

方法3:环境变量(安全性较低)

python
hf_jobs("uv", {
    "script": "your_script.py",
    "env": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Less secure than secrets
})
Difference from secrets:
  • env
    variables are visible in job logs
  • secrets
    are encrypted server-side
  • Always prefer
    secrets
    for tokens
python
hf_jobs("uv", {
    "script": "your_script.py",
    "env": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ 安全性低于密钥方式
})
与密钥的区别:
  • env
    变量会在作业日志中可见
  • secrets
    会在服务器端加密
  • 始终优先使用
    secrets
    传递令牌

Using Tokens in Scripts

在脚本中使用令牌

In your Python script, tokens are available as environment variables:
python
undefined
在Python脚本中,令牌可通过环境变量获取:
python
undefined

/// script

/// script

dependencies = ["huggingface-hub"]

dependencies = ["huggingface-hub"]

///

///

import os from huggingface_hub import HfApi
import os from huggingface_hub import HfApi

Token is automatically available if passed via secrets

如果通过secrets传入,令牌会自动可用

token = os.environ.get("HF_TOKEN")
token = os.environ.get("HF_TOKEN")

Use with Hub API

与Hub API一起使用

api = HfApi(token=token)
api = HfApi(token=token)

Or let huggingface_hub auto-detect

或让huggingface-hub自动检测

api = HfApi() # Automatically uses HF_TOKEN env var

**Best practices:**
- Don't hardcode tokens in scripts
- Use `os.environ.get("HF_TOKEN")` to access
- Let `huggingface_hub` auto-detect when possible
- Verify token exists before Hub operations
api = HfApi() # 会自动使用HF_TOKEN环境变量

**最佳实践:**
- 不要在脚本中硬编码令牌
- 使用`os.environ.get("HF_TOKEN")`获取令牌
- 尽可能让`huggingface-hub`自动检测
- 在执行Hub操作前验证令牌是否存在

Token Verification

令牌验证

Check if you're logged in:
python
from huggingface_hub import whoami
user_info = whoami()  # Returns your username if authenticated
Verify token in job:
python
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
token = os.environ["HF_TOKEN"]
print(f"Token starts with: {token[:7]}...")  # Should start with "hf_"
检查是否已登录:
python
from huggingface_hub import whoami
user_info = whoami()  # 已认证的话会返回用户名
在作业中验证令牌:
python
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN未找到!"
token = os.environ["HF_TOKEN"]
print(f"令牌开头为: {token[:7]}...")  # 应该以"hf_"开头

Common Token Issues

常见令牌问题

Error: 401 Unauthorized
  • Cause: Token missing or invalid
  • Fix: Add
    secrets={"HF_TOKEN": "$HF_TOKEN"}
    to job config
  • Verify: Check
    hf_whoami()
    works locally
Error: 403 Forbidden
Error: Token not found in environment
  • Cause:
    secrets
    not passed or wrong key name
  • Fix: Use
    secrets={"HF_TOKEN": "$HF_TOKEN"}
    (not
    env
    )
  • Verify: Script checks
    os.environ.get("HF_TOKEN")
Error: Repository access denied
  • Cause: Token doesn't have access to private repo
  • Fix: Use token from account with access
  • Check: Verify repo visibility and your permissions
错误:401 Unauthorized
  • 原因: 令牌缺失或无效
  • 解决方法: 在作业配置中添加
    secrets={"HF_TOKEN": "$HF_TOKEN"}
  • 验证: 确认本地
    hf_whoami()
    可正常运行
错误:403 Forbidden
错误:环境中未找到令牌
  • 原因: 未传入secrets或密钥名称错误
  • 解决方法: 使用
    secrets={"HF_TOKEN": "$HF_TOKEN"}
    (而非
    env
  • 验证: 脚本中检查
    os.environ.get("HF_TOKEN")
错误:仓库访问被拒绝
  • 原因: 令牌无权访问私有仓库
  • 解决方法: 使用具备访问权限的账户的令牌
  • 检查: 验证仓库可见性及你的权限

Token Security Best Practices

令牌安全最佳实践

  1. Never commit tokens - Use
    $HF_TOKEN
    placeholder or environment variables
  2. Use secrets, not env - Secrets are encrypted server-side
  3. Rotate tokens regularly - Generate new tokens periodically
  4. Use minimal permissions - Create tokens with only needed permissions
  5. Don't share tokens - Each user should use their own token
  6. Monitor token usage - Check token activity in Hub settings
  1. 永远不要提交令牌 - 使用
    $HF_TOKEN
    占位符或环境变量
  2. 使用secrets而非env - Secrets会在服务器端加密
  3. 定期轮换令牌 - 定期生成新令牌
  4. 使用最小权限 - 创建仅具备所需权限的令牌
  5. 不要共享令牌 - 每个用户应使用自己的令牌
  6. 监控令牌使用 - 在Hub设置中检查令牌活动

Complete Token Example

完整令牌示例

python
undefined
python
undefined

Example: Push results to Hub

示例:将结果推送到Hub

hf_jobs("uv", { "script": """
hf_jobs("uv", { "script": """

/// script

/// script

dependencies = ["huggingface-hub", "datasets"]

dependencies = ["huggingface-hub", "datasets"]

///

///

import os from huggingface_hub import HfApi from datasets import Dataset
import os from huggingface_hub import HfApi from datasets import Dataset

Verify token is available

验证令牌是否可用

assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
assert "HF_TOKEN" in os.environ, "需要HF_TOKEN!"

Use token for Hub operations

使用令牌执行Hub操作

api = HfApi(token=os.environ["HF_TOKEN"])
api = HfApi(token=os.environ["HF_TOKEN"])

Create and push dataset

创建并推送数据集

data = {"text": ["Hello", "World"]} dataset = Dataset.from_dict(data) dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])
print("✅ Dataset pushed successfully!") """, "flavor": "cpu-basic", "timeout": "30m", "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided securely })
undefined
data = {"text": ["Hello", "World"]} dataset = Dataset.from_dict(data) dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])
print("✅ 数据集推送成功!") """, "flavor": "cpu-basic", "timeout": "30m", "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ 安全传入令牌 })
undefined

Quick Start: Two Approaches

快速入门:两种方式

Approach 1: UV Scripts (Recommended)

方式1:UV脚本(推荐)

UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.
MCP Tool:
python
hf_jobs("uv", {
    "script": """
UV脚本使用PEP 723内联依赖,实现简洁、独立的工作负载。
MCP工具:
python
hf_jobs("uv", {
    "script": """

/// script

/// script

dependencies = ["transformers", "torch"]

dependencies = ["transformers", "torch"]

///

///

from transformers import pipeline import torch
from transformers import pipeline import torch

Your workload here

你的工作负载代码

classifier = pipeline("sentiment-analysis") result = classifier("I love Hugging Face!") print(result) """, "flavor": "cpu-basic", "timeout": "30m" })

**CLI Equivalent:**
```bash
hf jobs uv run my_script.py --flavor cpu-basic --timeout 30m
Python API:
python
from huggingface_hub import run_uv_job
run_uv_job("my_script.py", flavor="cpu-basic", timeout="30m")
Benefits: Direct MCP tool usage, clean code, dependencies declared inline, no file saving required
When to use: Default choice for all workloads, custom logic, any scenario requiring
hf_jobs()
classifier = pipeline("sentiment-analysis") result = classifier("I love Hugging Face!") print(result) """, "flavor": "cpu-basic", "timeout": "30m" })

**CLI等效命令:**
```bash
hf jobs uv run my_script.py --flavor cpu-basic --timeout 30m
Python API:
python
from huggingface_hub import run_uv_job
run_uv_job("my_script.py", flavor="cpu-basic", timeout="30m")
优势: 直接使用MCP工具,代码简洁,内声明依赖,无需保存文件
适用场景: 所有工作负载的默认选择、自定义逻辑、任何需要使用
hf_jobs()
的场景

Custom Docker Images for UV Scripts

为UV脚本使用自定义Docker镜像

By default, UV scripts use
ghcr.io/astral-sh/uv:python3.12-bookworm-slim
. For ML workloads with complex dependencies, use pre-built images:
python
hf_jobs("uv", {
    "script": "inference.py",
    "image": "vllm/vllm-openai:latest",  # Pre-built image with vLLM
    "flavor": "a10g-large"
})
CLI:
bash
hf jobs uv run --image vllm/vllm-openai:latest --flavor a10g-large inference.py
Benefits: Faster startup, pre-installed dependencies, optimized for specific frameworks
默认情况下,UV脚本使用
ghcr.io/astral-sh/uv:python3.12-bookworm-slim
。对于依赖复杂的机器学习工作负载,可使用预构建镜像:
python
hf_jobs("uv", {
    "script": "inference.py",
    "image": "vllm/vllm-openai:latest",  # 包含vLLM的预构建镜像
    "flavor": "a10g-large"
})
CLI命令:
bash
hf jobs uv run --image vllm/vllm-openai:latest --flavor a10g-large inference.py
优势: 启动更快,预安装依赖,针对特定框架优化

Python Version

Python版本

By default, UV scripts use Python 3.12. Specify a different version:
python
hf_jobs("uv", {
    "script": "my_script.py",
    "python": "3.11",  # Use Python 3.11
    "flavor": "cpu-basic"
})
Python API:
python
from huggingface_hub import run_uv_job
run_uv_job("my_script.py", python="3.11")
默认情况下,UV脚本使用Python 3.12。可指定其他版本:
python
hf_jobs("uv", {
    "script": "my_script.py",
    "python": "3.11",  # 使用Python 3.11
    "flavor": "cpu-basic"
})
Python API:
python
from huggingface_hub import run_uv_job
run_uv_job("my_script.py", python="3.11")

Working with Scripts

脚本使用注意事项

⚠️ Important: There are two "script path" stories depending on how you run Jobs:
  • Using the
    hf_jobs()
    MCP tool (recommended in this repo)
    : the
    script
    value must be inline code (a string) or a URL. A local filesystem path (like
    "./scripts/foo.py"
    ) won't exist inside the remote container.
  • Using the
    hf jobs uv run
    CLI
    : local file paths do work (the CLI uploads your script).
Common mistake with
hf_jobs()
MCP tool:
python
undefined
⚠️ 重要提示: 根据运行Jobs的方式,"脚本路径"有两种不同的处理逻辑:
  • 使用
    hf_jobs()
    MCP工具(本仓库推荐方式)
    script
    参数必须是内联代码(字符串)或URL。本地文件系统路径(如
    "./scripts/foo.py"
    )在远程容器中不存在。
  • 使用
    hf jobs uv run
    CLI命令
    :本地文件路径可以正常使用(CLI会上传你的脚本)。
使用
hf_jobs()
MCP工具的常见错误:
python
undefined

❌ Will fail (remote container can't see your local path)

❌ 会失败(远程容器无法访问你的本地路径)

hf_jobs("uv", {"script": "./scripts/foo.py"})

**Correct patterns with `hf_jobs()` MCP tool:**

```python
hf_jobs("uv", {"script": "./scripts/foo.py"})

**使用`hf_jobs()` MCP工具的正确方式:**

```python

✅ Inline: read the local script file and pass its contents

✅ 内联:读取本地脚本文件并传入其内容

from pathlib import Path script = Path("hf-jobs/scripts/foo.py").read_text() hf_jobs("uv", {"script": script})
from pathlib import Path script = Path("hf-jobs/scripts/foo.py").read_text() hf_jobs("uv", {"script": script})

✅ URL: host the script somewhere reachable

✅ URL:将脚本托管在可访问的位置

✅ URL from GitHub

✅ GitHub上的URL


**CLI equivalent (local paths supported):**

```bash
hf jobs uv run ./scripts/foo.py -- --your --args

**支持本地路径的CLI等效命令:**

```bash
hf jobs uv run ./scripts/foo.py -- --your --args

Adding Dependencies at Runtime

在运行时添加依赖

Add extra dependencies beyond what's in the PEP 723 header:
python
hf_jobs("uv", {
    "script": "inference.py",
    "dependencies": ["transformers", "torch>=2.0"],  # Extra deps
    "flavor": "a10g-small"
})
Python API:
python
from huggingface_hub import run_uv_job
run_uv_job("inference.py", dependencies=["transformers", "torch>=2.0"])
除PEP 723头中声明的依赖外,可添加额外依赖:
python
hf_jobs("uv", {
    "script": "inference.py",
    "dependencies": ["transformers", "torch>=2.0"],  # 额外依赖
    "flavor": "a10g-small"
})
Python API:
python
from huggingface_hub import run_uv_job
run_uv_job("inference.py", dependencies=["transformers", "torch>=2.0"])

Approach 2: Docker-Based Jobs

方式2:基于Docker的作业

Run jobs with custom Docker images and commands.
MCP Tool:
python
hf_jobs("run", {
    "image": "python:3.12",
    "command": ["python", "-c", "print('Hello from HF Jobs!')"],
    "flavor": "cpu-basic",
    "timeout": "30m"
})
CLI Equivalent:
bash
hf jobs run python:3.12 python -c "print('Hello from HF Jobs!')"
Python API:
python
from huggingface_hub import run_job
run_job(image="python:3.12", command=["python", "-c", "print('Hello!')"], flavor="cpu-basic")
Benefits: Full Docker control, use pre-built images, run any command When to use: Need specific Docker images, non-Python workloads, complex environments
Example with GPU:
python
hf_jobs("run", {
    "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
    "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
    "flavor": "a10g-small",
    "timeout": "1h"
})
Using Hugging Face Spaces as Images:
You can use Docker images from HF Spaces:
python
hf_jobs("run", {
    "image": "hf.co/spaces/lhoestq/duckdb",  # Space as Docker image
    "command": ["duckdb", "-c", "SELECT 'Hello from DuckDB!'"],
    "flavor": "cpu-basic"
})
CLI:
bash
hf jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "SELECT 'Hello!'"
使用自定义Docker镜像和命令运行作业。
MCP工具:
python
hf_jobs("run", {
    "image": "python:3.12",
    "command": ["python", "-c", "print('Hello from HF Jobs!')"],
    "flavor": "cpu-basic",
    "timeout": "30m"
})
CLI等效命令:
bash
hf jobs run python:3.12 python -c "print('Hello from HF Jobs!')"
Python API:
python
from huggingface_hub import run_job
run_job(image="python:3.12", command=["python", "-c", "print('Hello!')"], flavor="cpu-basic")
优势: 完全控制Docker,可使用预构建镜像,运行任意命令 适用场景: 需要特定Docker镜像、非Python工作负载、复杂环境
GPU示例:
python
hf_jobs("run", {
    "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
    "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
    "flavor": "a10g-small",
    "timeout": "1h"
})
使用Hugging Face Spaces作为镜像:
你可以使用来自HF Spaces的Docker镜像:
python
hf_jobs("run", {
    "image": "hf.co/spaces/lhoestq/duckdb",  # 将Space作为Docker镜像
    "command": ["duckdb", "-c", "SELECT 'Hello from DuckDB!'"],
    "flavor": "cpu-basic"
})
CLI命令:
bash
hf jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "SELECT 'Hello!'"

Finding More UV Scripts on Hub

在Hub上查找更多UV脚本

The
uv-scripts
organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
python
undefined
uv-scripts
组织在Hugging Face Hub上提供了现成可用的UV脚本,存储为数据集:
python
undefined

Discover available UV script collections

发现可用的UV脚本集合

dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})

Explore a specific collection

浏览特定集合

hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)

**Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)

**热门集合:** OCR、分类、合成数据、vLLM、数据集创建

Hardware Selection

硬件选择

Reference: HF Jobs Hardware Docs (updated 07/2025)
Workload TypeRecommended HardwareUse Case
Data processing, testing
cpu-basic
,
cpu-upgrade
Lightweight tasks
Small models, demos
t4-small
<1B models, quick tests
Medium models
t4-medium
,
l4x1
1-7B models
Large models, production
a10g-small
,
a10g-large
7-13B models
Very large models
a100-large
13B+ models
Batch inference
a10g-large
,
a100-large
High-throughput
Multi-GPU workloads
l4x4
,
a10g-largex2
,
a10g-largex4
Parallel/large models
TPU workloads
v5e-1x1
,
v5e-2x2
,
v5e-2x4
JAX/Flax, TPU-optimized
All Available Flavors:
  • CPU:
    cpu-basic
    ,
    cpu-upgrade
  • GPU:
    t4-small
    ,
    t4-medium
    ,
    l4x1
    ,
    l4x4
    ,
    a10g-small
    ,
    a10g-large
    ,
    a10g-largex2
    ,
    a10g-largex4
    ,
    a100-large
  • TPU:
    v5e-1x1
    ,
    v5e-2x2
    ,
    v5e-2x4
Guidelines:
  • Start with smaller hardware for testing
  • Scale up based on actual needs
  • Use multi-GPU for parallel workloads or large models
  • Use TPUs for JAX/Flax workloads
  • See
    references/hardware_guide.md
    for detailed specifications
参考: HF Jobs硬件文档(更新于2025年7月)
工作负载类型推荐硬件适用场景
数据处理、测试
cpu-basic
,
cpu-upgrade
轻量任务
小型模型、演示
t4-small
小于10亿参数的模型、快速测试
中型模型
t4-medium
,
l4x1
10亿-70亿参数的模型
大型模型、生产环境
a10g-small
,
a10g-large
70亿-130亿参数的模型
超大型模型
a100-large
130亿参数以上的模型
批处理推理
a10g-large
,
a100-large
高吞吐量
多GPU工作负载
l4x4
,
a10g-largex2
,
a10g-largex4
并行工作负载或大型模型
TPU工作负载
v5e-1x1
,
v5e-2x2
,
v5e-2x4
JAX/Flax、TPU优化的工作负载
所有可用硬件规格:
  • CPU:
    cpu-basic
    ,
    cpu-upgrade
  • GPU:
    t4-small
    ,
    t4-medium
    ,
    l4x1
    ,
    l4x4
    ,
    a10g-small
    ,
    a10g-large
    ,
    a10g-largex2
    ,
    a10g-largex4
    ,
    a100-large
  • TPU:
    v5e-1x1
    ,
    v5e-2x2
    ,
    v5e-2x4
指导原则:
  • 测试时先使用较小的硬件
  • 根据实际需求扩容
  • 并行工作负载或大型模型使用多GPU
  • JAX/Flax工作负载使用TPU
  • 详细规格请查看
    references/hardware_guide.md

Critical: Saving Results

关键:保存结果

⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS
The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, ALL WORK IS LOST.
⚠️ 环境为临时状态——必须持久化结果
Jobs环境是临时的。作业结束后所有文件都会被删除。如果不持久化结果,所有工作都会丢失

Persistence Options

持久化选项

1. Push to Hugging Face Hub (Recommended)
python
undefined
1. 推送到Hugging Face Hub(推荐)
python
undefined

Push models

推送模型

model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])
model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])

Push datasets

推送数据集

dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])
dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])

Push artifacts

推送制品

api.upload_file( path_or_fileobj="results.json", path_in_repo="results.json", repo_id="username/results", token=os.environ["HF_TOKEN"] )

**2. Use External Storage**

```python
api.upload_file( path_or_fileobj="results.json", path_in_repo="results.json", repo_id="username/results", token=os.environ["HF_TOKEN"] )

**2. 使用外部存储**

```python

Upload to S3, GCS, etc.

上传到S3、GCS等

import boto3 s3 = boto3.client('s3') s3.upload_file('results.json', 'my-bucket', 'results.json')

**3. Send Results via API**

```python
import boto3 s3 = boto3.client('s3') s3.upload_file('results.json', 'my-bucket', 'results.json')

**3. 通过API发送结果**

```python

POST results to your API

将结果POST到你的API

import requests requests.post("https://your-api.com/results", json=results)
undefined
import requests requests.post("https://your-api.com/results", json=results)
undefined

Required Configuration for Hub Push

推送到Hub的必要配置

In job submission:
python
undefined
作业提交时:
python
undefined

hf_jobs MCP tool:

hf_jobs MCP工具:

{"secrets": {"HF_TOKEN": "$HF_TOKEN"}} # auto-replaced
{"secrets": {"HF_TOKEN": "$HF_TOKEN"}} # 自动替换

HfApi().run_uv_job():

HfApi().run_uv_job():

from huggingface_hub import get_token secrets={"HF_TOKEN": get_token()} # must pass real token

**In script:**
```python
import os
from huggingface_hub import HfApi
from huggingface_hub import get_token secrets={"HF_TOKEN": get_token()} # 必须传入实际令牌

**在脚本中:**
```python
import os
from huggingface_hub import HfApi

Token automatically available from secrets

令牌会从secrets自动传入环境变量

api = HfApi(token=os.environ.get("HF_TOKEN"))
api = HfApi(token=os.environ.get("HF_TOKEN"))

Push your results

推送你的结果

api.upload_file(...)
undefined
api.upload_file(...)
undefined

Verification Checklist

验证清单

Before submitting:
  • Results persistence method chosen
  • Token in secrets if using Hub (MCP:
    "$HF_TOKEN"
    , Python API:
    get_token()
    )
  • Script handles missing token gracefully
  • Test persistence path works
See:
references/hub_saving.md
for detailed Hub persistence guide
提交前:
  • 已选择结果持久化方式
  • 如果使用Hub,已在secrets中传入令牌(MCP工具使用
    "$HF_TOKEN"
    ,Python API使用
    get_token()
  • 脚本可优雅处理令牌缺失的情况
  • 持久化路径已测试可用
参考:
references/hub_saving.md
获取详细的Hub持久化指南

Timeout Management

超时管理

⚠️ DEFAULT: 30 MINUTES
Jobs automatically stop after the timeout. For long-running tasks like training, always set a custom timeout.
⚠️ 默认值:30分钟
作业会在超时后自动停止。对于训练等长时间运行的任务,务必设置自定义超时时间。

Setting Timeouts

设置超时时间

MCP Tool:
python
{
    "timeout": "2h"   # 2 hours
}
Supported formats:
  • Integer/float: seconds (e.g.,
    300
    = 5 minutes)
  • String with suffix:
    "5m"
    (minutes),
    "2h"
    (hours),
    "1d"
    (days)
  • Examples:
    "90m"
    ,
    "2h"
    ,
    "1.5h"
    ,
    300
    ,
    "1d"
Python API:
python
from huggingface_hub import run_job, run_uv_job

run_job(image="python:3.12", command=[...], timeout="2h")
run_uv_job("script.py", timeout=7200)  # 2 hours in seconds
MCP工具:
python
{
    "timeout": "2h"   # 2小时
}
支持的格式:
  • 整数/浮点数:秒数(如
    300
    = 5分钟)
  • 带后缀的字符串:
    "5m"
    (分钟)、
    "2h"
    (小时)、
    "1d"
    (天)
  • 示例:
    "90m"
    ,
    "2h"
    ,
    "1.5h"
    ,
    300
    ,
    "1d"
Python API:
python
from huggingface_hub import run_job, run_uv_job

run_job(image="python:3.12", command=[...], timeout="2h")
run_uv_job("script.py", timeout=7200)  # 2小时(秒数)

Timeout Guidelines

超时时间指导原则

ScenarioRecommendedNotes
Quick test10-30 minVerify setup
Data processing1-2 hoursDepends on data size
Batch inference2-4 hoursLarge batches
Experiments4-8 hoursMultiple runs
Long-running8-24 hoursProduction workloads
Always add 20-30% buffer for setup, network delays, and cleanup.
On timeout: Job killed immediately, all unsaved progress lost
场景推荐值说明
快速测试10-30分钟验证设置
数据处理1-2小时取决于数据量
批处理推理2-4小时大型批处理任务
实验4-8小时多次运行
长时间运行任务8-24小时生产环境工作负载
始终预留20-30%的缓冲时间,用于启动、网络延迟和清理操作。
超时后: 作业会立即终止,所有未保存的进度都会丢失

Cost Estimation

成本估算

General guidelines:
Total Cost = (Hours of runtime) × (Cost per hour)
Example calculations:
Quick test:
  • Hardware: cpu-basic ($0.10/hour)
  • Time: 15 minutes (0.25 hours)
  • Cost: $0.03
Data processing:
  • Hardware: l4x1 ($2.50/hour)
  • Time: 2 hours
  • Cost: $5.00
Batch inference:
  • Hardware: a10g-large ($5/hour)
  • Time: 4 hours
  • Cost: $20.00
Cost optimization tips:
  1. Start small - Test on cpu-basic or t4-small
  2. Monitor runtime - Set appropriate timeouts
  3. Use checkpoints - Resume if job fails
  4. Optimize code - Reduce unnecessary compute
  5. Choose right hardware - Don't over-provision
通用公式:
总成本 = 运行时长(小时) × 每小时成本
示例计算:
快速测试:
  • 硬件:cpu-basic(0.10美元/小时)
  • 时长:15分钟(0.25小时)
  • 成本:0.03美元
数据处理:
  • 硬件:l4x1(2.50美元/小时)
  • 时长:2小时
  • 成本:5.00美元
批处理推理:
  • 硬件:a10g-large(5美元/小时)
  • 时长:4小时
  • 成本:20.00美元
成本优化技巧:
  1. 从小规模开始 - 在cpu-basic或t4-small上测试
  2. 监控运行时长 - 设置合适的超时时间
  3. 使用检查点 - 作业失败后可恢复
  4. 优化代码 - 减少不必要的计算
  5. 选择合适的硬件 - 不要过度配置

Monitoring and Tracking

监控与追踪

Check Job Status

检查作业状态

MCP Tool:
python
undefined
MCP工具:
python
undefined

List all jobs

列出所有作业

hf_jobs("ps")
hf_jobs("ps")

Inspect specific job

查看特定作业详情

hf_jobs("inspect", {"job_id": "your-job-id"})
hf_jobs("inspect", {"job_id": "your-job-id"})

View logs

查看日志

hf_jobs("logs", {"job_id": "your-job-id"})
hf_jobs("logs", {"job_id": "your-job-id"})

Cancel a job

取消作业

hf_jobs("cancel", {"job_id": "your-job-id"})

**Python API:**
```python
from huggingface_hub import list_jobs, inspect_job, fetch_job_logs, cancel_job
hf_jobs("cancel", {"job_id": "your-job-id"})

**Python API:**
```python
from huggingface_hub import list_jobs, inspect_job, fetch_job_logs, cancel_job

List your jobs

列出你的作业

jobs = list_jobs()
jobs = list_jobs()

List running jobs only

仅列出运行中的作业

running = [j for j in list_jobs() if j.status.stage == "RUNNING"]
running = [j for j in list_jobs() if j.status.stage == "RUNNING"]

Inspect specific job

查看特定作业详情

job_info = inspect_job(job_id="your-job-id")
job_info = inspect_job(job_id="your-job-id")

View logs

查看日志

for log in fetch_job_logs(job_id="your-job-id"): print(log)
for log in fetch_job_logs(job_id="your-job-id"): print(log)

Cancel a job

取消作业

cancel_job(job_id="your-job-id")

**CLI:**
```bash
hf jobs ps                    # List jobs
hf jobs logs <job-id>         # View logs
hf jobs cancel <job-id>       # Cancel job
Remember: Wait for user to request status checks. Avoid polling repeatedly.
cancel_job(job_id="your-job-id")

**CLI命令:**
```bash
hf jobs ps                    # 列出作业
hf jobs logs <job-id>         # 查看日志
hf jobs cancel <job-id>       # 取消作业
注意: 等待用户请求状态检查,避免重复轮询。

Job URLs

作业URL

After submission, jobs have monitoring URLs:
https://huggingface.co/jobs/username/job-id
View logs, status, and details in the browser.
提交完成后,作业会有监控URL:
https://huggingface.co/jobs/username/job-id
可在浏览器中查看日志、状态和详情。

Wait for Multiple Jobs

等待多个作业完成

python
import time
from huggingface_hub import inspect_job, run_job
python
import time
from huggingface_hub import inspect_job, run_job

Run multiple jobs

运行多个作业

jobs = [run_job(image=img, command=cmd) for img, cmd in workloads]
jobs = [run_job(image=img, command=cmd) for img, cmd in workloads]

Wait for all to complete

等待所有作业完成

for job in jobs: while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"): time.sleep(10)
undefined
for job in jobs: while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"): time.sleep(10)
undefined

Scheduled Jobs

定时作业

Run jobs on a schedule using CRON expressions or predefined schedules.
MCP Tool:
python
undefined
使用CRON表达式或预定义计划定时运行作业。
MCP工具:
python
undefined

Schedule a UV script that runs every hour

定时运行UV脚本,每小时执行一次

hf_jobs("scheduled uv", { "script": "your_script.py", "schedule": "@hourly", "flavor": "cpu-basic" })
hf_jobs("scheduled uv", { "script": "your_script.py", "schedule": "@hourly", "flavor": "cpu-basic" })

Schedule with CRON syntax

使用CRON语法定时

hf_jobs("scheduled uv", { "script": "your_script.py", "schedule": "0 9 * * 1", # 9 AM every Monday "flavor": "cpu-basic" })
hf_jobs("scheduled uv", { "script": "your_script.py", "schedule": "0 9 * * 1", # 每周一上午9点 "flavor": "cpu-basic" })

Schedule a Docker-based job

定时运行基于Docker的作业

hf_jobs("scheduled run", { "image": "python:3.12", "command": ["python", "-c", "print('Scheduled!')"], "schedule": "@daily", "flavor": "cpu-basic" })

**Python API:**
```python
from huggingface_hub import create_scheduled_job, create_scheduled_uv_job
hf_jobs("scheduled run", { "image": "python:3.12", "command": ["python", "-c", "print('Scheduled!')"], "schedule": "@daily", "flavor": "cpu-basic" })

**Python API:**
```python
from huggingface_hub import create_scheduled_job, create_scheduled_uv_job

Schedule a Docker job

定时运行Docker作业

create_scheduled_job( image="python:3.12", command=["python", "-c", "print('Running on schedule!')"], schedule="@hourly" )
create_scheduled_job( image="python:3.12", command=["python", "-c", "print('Running on schedule!')"], schedule="@hourly" )

Schedule a UV script

定时运行UV脚本

create_scheduled_uv_job("my_script.py", schedule="@daily", flavor="cpu-basic")
create_scheduled_uv_job("my_script.py", schedule="@daily", flavor="cpu-basic")

Schedule with GPU

使用GPU定时运行

create_scheduled_uv_job( "ml_inference.py", schedule="0 */6 * * *", # Every 6 hours flavor="a10g-small" )

**Available schedules:**
- `@annually`, `@yearly` - Once per year
- `@monthly` - Once per month
- `@weekly` - Once per week
- `@daily` - Once per day
- `@hourly` - Once per hour
- CRON expression - Custom schedule (e.g., `"*/5 * * * *"` for every 5 minutes)

**Manage scheduled jobs:**
```python
create_scheduled_uv_job( "ml_inference.py", schedule="0 */6 * * *", # 每6小时一次 flavor="a10g-small" )

**可用的计划:**
- `@annually`, `@yearly` - 每年一次
- `@monthly` - 每月一次
- `@weekly` - 每周一次
- `@daily` - 每天一次
- `@hourly` - 每小时一次
- CRON表达式 - 自定义计划(如`"*/5 * * * *"`表示每5分钟一次)

**管理定时作业:**
```python

MCP Tool

MCP工具

hf_jobs("scheduled ps") # List scheduled jobs hf_jobs("scheduled inspect", {"job_id": "..."}) # Inspect details hf_jobs("scheduled suspend", {"job_id": "..."}) # Pause hf_jobs("scheduled resume", {"job_id": "..."}) # Resume hf_jobs("scheduled delete", {"job_id": "..."}) # Delete

**Python API for management:**
```python
from huggingface_hub import (
    list_scheduled_jobs,
    inspect_scheduled_job,
    suspend_scheduled_job,
    resume_scheduled_job,
    delete_scheduled_job
)
hf_jobs("scheduled ps") # 列出定时作业 hf_jobs("scheduled inspect", {"job_id": "..."}) # 查看详情 hf_jobs("scheduled suspend", {"job_id": "..."}) # 暂停 hf_jobs("scheduled resume", {"job_id": "..."}) # 恢复 hf_jobs("scheduled delete", {"job_id": "..."}) # 删除

**用于管理的Python API:**
```python
from huggingface_hub import (
    list_scheduled_jobs,
    inspect_scheduled_job,
    suspend_scheduled_job,
    resume_scheduled_job,
    delete_scheduled_job
)

List all scheduled jobs

列出所有定时作业

scheduled = list_scheduled_jobs()
scheduled = list_scheduled_jobs()

Inspect a scheduled job

查看定时作业详情

info = inspect_scheduled_job(scheduled_job_id)
info = inspect_scheduled_job(scheduled_job_id)

Suspend (pause) a scheduled job

暂停定时作业

suspend_scheduled_job(scheduled_job_id)
suspend_scheduled_job(scheduled_job_id)

Resume a scheduled job

恢复定时作业

resume_scheduled_job(scheduled_job_id)
resume_scheduled_job(scheduled_job_id)

Delete a scheduled job

删除定时作业

delete_scheduled_job(scheduled_job_id)
undefined
delete_scheduled_job(scheduled_job_id)
undefined

Webhooks: Trigger Jobs on Events

Webhooks:事件触发作业

Trigger jobs automatically when changes happen in Hugging Face repositories.
Python API:
python
from huggingface_hub import create_webhook
当Hugging Face仓库发生变更时,自动触发作业运行。
Python API:
python
from huggingface_hub import create_webhook

Create webhook that triggers a job when a repo changes

创建Webhook,当仓库变更时触发作业

webhook = create_webhook( job_id=job.id, watched=[ {"type": "user", "name": "your-username"}, {"type": "org", "name": "your-org-name"} ], domains=["repo", "discussion"], secret="your-secret" )

**How it works:**
1. Webhook listens for changes in watched repositories
2. When triggered, the job runs with `WEBHOOK_PAYLOAD` environment variable
3. Your script can parse the payload to understand what changed

**Use cases:**
- Auto-process new datasets when uploaded
- Trigger inference when models are updated
- Run tests when code changes
- Generate reports on repository activity

**Access webhook payload in script:**
```python
import os
import json

payload = json.loads(os.environ.get("WEBHOOK_PAYLOAD", "{}"))
print(f"Event type: {payload.get('event', {}).get('action')}")
See Webhooks Documentation for more details.
webhook = create_webhook( job_id=job.id, watched=[ {"type": "user", "name": "your-username"}, {"type": "org", "name": "your-org-name"} ], domains=["repo", "discussion"], secret="your-secret" )

**工作原理:**
1. Webhook监听指定仓库的变更
2. 触发时,作业会在环境变量`WEBHOOK_PAYLOAD`中获取相关信息
3. 你的脚本可以解析该负载,了解具体发生了什么变更

**适用场景:**
- 上传新数据集时自动处理
- 模型更新时触发推理
- 代码变更时运行测试
- 针对仓库活动生成报告

**在脚本中访问Webhook负载:**
```python
import os
import json

payload = json.loads(os.environ.get("WEBHOOK_PAYLOAD", "{}"))
print(f"事件类型: {payload.get('event', {}).get('action')}")
更多详情请查看Webhooks文档

Common Workload Patterns

常见工作负载模式

This repository ships ready-to-run UV scripts in
hf-jobs/scripts/
. Prefer using them instead of inventing new templates.
本仓库在
hf-jobs/scripts/
中提供了现成可用的UV脚本。优先使用这些脚本,而非自行编写新模板。

Pattern 1: Dataset → Model Responses (vLLM) —
scripts/generate-responses.py

模式1:数据集→模型响应(vLLM)——
scripts/generate-responses.py

What it does: loads a Hub dataset (chat
messages
or a
prompt
column), applies a model chat template, generates responses with vLLM, and pushes the output dataset + dataset card back to the Hub.
Requires: GPU + write token (it pushes a dataset).
python
from pathlib import Path

script = Path("hf-jobs/scripts/generate-responses.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "username/input-dataset",
        "username/output-dataset",
        "--messages-column", "messages",
        "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
        "--temperature", "0.7",
        "--top-p", "0.8",
        "--max-tokens", "2048",
    ],
    "flavor": "a10g-large",
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})
功能: 加载Hub中的数据集(聊天
messages
prompt
列),应用模型聊天模板,使用vLLM生成响应,并将输出数据集和数据集卡片推送回Hub。
要求: GPU + 令牌(需要推送数据集)。
python
from pathlib import Path

script = Path("hf-jobs/scripts/generate-responses.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "username/input-dataset",
        "username/output-dataset",
        "--messages-column", "messages",
        "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
        "--temperature", "0.7",
        "--top-p", "0.8",
        "--max-tokens", "2048",
    ],
    "flavor": "a10g-large",
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Pattern 2: CoT Self-Instruct Synthetic Data —
scripts/cot-self-instruct.py

模式2:思维链自指令合成数据——
scripts/cot-self-instruct.py

What it does: generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then pushes the generated dataset + dataset card to the Hub.
Requires: GPU + write token (it pushes a dataset).
python
from pathlib import Path

script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--seed-dataset", "davanstrien/s1k-reasoning",
        "--output-dataset", "username/synthetic-math",
        "--task-type", "reasoning",
        "--num-samples", "5000",
        "--filter-method", "answer-consistency",
    ],
    "flavor": "l4x4",
    "timeout": "8h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})
功能: 通过思维链自指令生成合成提示/答案,可选择过滤输出(答案一致性/RIP),然后将生成的数据集和数据集卡片推送到Hub。
要求: GPU + 令牌(需要推送数据集)。
python
from pathlib import Path

script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--seed-dataset", "davanstrien/s1k-reasoning",
        "--output-dataset", "username/synthetic-math",
        "--task-type", "reasoning",
        "--num-samples", "5000",
        "--filter-method", "answer-consistency",
    ],
    "flavor": "l4x4",
    "timeout": "8h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Pattern 3: Streaming Dataset Stats (Polars + HF Hub) —
scripts/finepdfs-stats.py

模式3:流式数据集统计(Polars + HF Hub)——
scripts/finepdfs-stats.py

What it does: scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.
Requires: CPU is often enough; token needed only if you pass
--output-repo
(upload).
python
from pathlib import Path

script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--limit", "10000",
        "--show-plan",
        "--output-repo", "username/finepdfs-temporal-stats",
    ],
    "flavor": "cpu-upgrade",
    "timeout": "2h",
    "env": {"HF_XET_HIGH_PERFORMANCE": "1"},
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})
功能: 直接从Hub扫描parquet文件(无需下载300GB数据),计算时间统计信息,并(可选)将结果上传到Hub数据集仓库。
要求: 通常CPU即可;仅当传入
--output-repo
(上传)时需要令牌。
python
from pathlib import Path

script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--limit", "10000",
        "--show-plan",
        "--output-repo", "username/finepdfs-temporal-stats",
    ],
    "flavor": "cpu-upgrade",
    "timeout": "2h",
    "env": {"HF_XET_HIGH_PERFORMANCE": "1"},
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Common Failure Modes

常见失败模式

Out of Memory (OOM)

内存不足(OOM)

Fix:
  1. Reduce batch size or data chunk size
  2. Process data in smaller batches
  3. Upgrade hardware: cpu → t4 → a10g → a100
解决方法:
  1. 减小批量大小或数据块大小
  2. 分小批量处理数据
  3. 升级硬件:cpu → t4 → a10g → a100

Job Timeout

作业超时

Fix:
  1. Check logs for actual runtime
  2. Increase timeout with buffer:
    "timeout": "3h"
  3. Optimize code for faster execution
  4. Process data in chunks
解决方法:
  1. 查看日志了解实际运行时长
  2. 增加超时时间并预留缓冲:
    "timeout": "3h"
  3. 优化代码以提升执行速度
  4. 分块处理数据

Hub Push Failures

Hub推送失败

Fix:
  1. Add token to secrets: MCP uses
    "$HF_TOKEN"
    (auto-replaced), Python API uses
    get_token()
    (must pass real token)
  2. Verify token in script:
    assert "HF_TOKEN" in os.environ
  3. Check token permissions
  4. Verify repo exists or can be created
解决方法:
  1. 在secrets中添加令牌:MCP工具使用
    "$HF_TOKEN"
    (自动替换),Python API使用
    get_token()
    (必须传入实际令牌)
  2. 在脚本中验证令牌:
    assert "HF_TOKEN" in os.environ
  3. 检查令牌权限
  4. 验证仓库是否存在或可创建

Missing Dependencies

依赖缺失

Fix: Add to PEP 723 header:
python
undefined
解决方法: 添加到PEP 723头中:
python
undefined

/// script

/// script

dependencies = ["package1", "package2>=1.0.0"]

dependencies = ["package1", "package2>=1.0.0"]

///

///

undefined
undefined

Authentication Errors

认证错误

Fix:
  1. Check
    hf_whoami()
    works locally
  2. Verify token in secrets — MCP:
    "$HF_TOKEN"
    , Python API:
    get_token()
    (NOT
    "$HF_TOKEN"
    )
  3. Re-login:
    hf auth login
  4. Check token has required permissions
解决方法:
  1. 确认本地
    hf_whoami()
    可正常运行
  2. 验证secrets中的令牌——MCP工具使用
    "$HF_TOKEN"
    ,Python API使用
    get_token()
    (而非
    "$HF_TOKEN"
  3. 重新登录:
    hf auth login
  4. 检查令牌是否具备所需权限

Troubleshooting

故障排除

Common issues:
  • Job times out → Increase timeout, optimize code
  • Results not saved → Check persistence method, verify HF_TOKEN
  • Out of Memory → Reduce batch size, upgrade hardware
  • Import errors → Add dependencies to PEP 723 header
  • Authentication errors → Check token, verify secrets parameter
See:
references/troubleshooting.md
for complete troubleshooting guide
常见问题:
  • 作业超时 → 增加超时时间、优化代码
  • 结果未保存 → 检查持久化方式、验证HF_TOKEN
  • 内存不足 → 减小批量大小、升级硬件
  • 导入错误 → 在PEP 723头中添加依赖
  • 认证错误 → 检查令牌、验证secrets参数
参考:
references/troubleshooting.md
获取完整的故障排除指南

Resources

资源

References (In This Skill)

本技能内的参考文档

  • references/token_usage.md
    - Complete token usage guide
  • references/hardware_guide.md
    - Hardware specs and selection
  • references/hub_saving.md
    - Hub persistence guide
  • references/troubleshooting.md
    - Common issues and solutions
  • references/token_usage.md
    - 完整的令牌使用指南
  • references/hardware_guide.md
    - 硬件规格与选择
  • references/hub_saving.md
    - Hub持久化指南
  • references/troubleshooting.md
    - 常见问题与解决方案

Scripts (In This Skill)

本技能内的脚本

  • scripts/generate-responses.py
    - vLLM batch generation: dataset → responses → push to Hub
  • scripts/cot-self-instruct.py
    - CoT Self-Instruct synthetic data generation + filtering → push to Hub
  • scripts/finepdfs-stats.py
    - Polars streaming stats over
    finepdfs-edu
    parquet on Hub (optional push)
  • scripts/generate-responses.py
    - vLLM批量生成:数据集→响应→推送到Hub
  • scripts/cot-self-instruct.py
    - 思维链自指令合成数据生成+过滤→推送到Hub
  • scripts/finepdfs-stats.py
    - 对Hub上的
    finepdfs-edu
    parquet文件进行Polars流式统计(可选推送)

External Links

外部链接

Official Documentation:
Related Tools:
官方文档:
相关工具:

Key Takeaways

核心要点

  1. Submit scripts inline - The
    script
    parameter accepts Python code directly; no file saving required unless user requests
  2. Jobs are asynchronous - Don't wait/poll; let user check when ready
  3. Always set timeout - Default 30 min may be insufficient; set appropriate timeout
  4. Always persist results - Environment is ephemeral; without persistence, all work is lost
  5. Use tokens securely - MCP:
    secrets={"HF_TOKEN": "$HF_TOKEN"}
    , Python API:
    secrets={"HF_TOKEN": get_token()}
    "$HF_TOKEN"
    only works with MCP tool
  6. Choose appropriate hardware - Start small, scale up based on needs (see hardware guide)
  7. Use UV scripts - Default to
    hf_jobs("uv", {...})
    with inline scripts for Python workloads
  8. Handle authentication - Verify tokens are available before Hub operations
  9. Monitor jobs - Provide job URLs and status check commands
  10. Optimize costs - Choose right hardware, set appropriate timeouts
  1. 内联提交脚本 -
    script
    参数可直接接收Python代码;除非用户要求,否则无需保存文件
  2. 作业为异步执行 - 不要等待/轮询;让用户在需要时自行检查
  3. 务必设置超时时间 - 默认30分钟可能不足;设置合适的超时时间
  4. 务必持久化结果 - 环境为临时状态;不持久化的话所有工作都会丢失
  5. 安全使用令牌 - MCP工具使用
    secrets={"HF_TOKEN": "$HF_TOKEN"}
    ,Python API使用
    secrets={"HF_TOKEN": get_token()}
    "$HF_TOKEN"
    仅适用于MCP工具
  6. 选择合适的硬件 - 从小规模开始,根据需求扩容(查看硬件指南)
  7. 使用UV脚本 - Python工作负载默认使用
    hf_jobs("uv", {...})
    和内联脚本
  8. 处理认证 - 在执行Hub操作前验证令牌是否可用
  9. 监控作业 - 提供作业URL和状态检查命令
  10. 优化成本 - 选择合适的硬件、设置合适的超时时间

Quick Reference: MCP Tool vs CLI vs Python API

快速参考:MCP工具 vs CLI vs Python API

OperationMCP ToolCLIPython API
Run UV script
hf_jobs("uv", {...})
hf jobs uv run script.py
run_uv_job("script.py")
Run Docker job
hf_jobs("run", {...})
hf jobs run image cmd
run_job(image, command)
List jobs
hf_jobs("ps")
hf jobs ps
list_jobs()
View logs
hf_jobs("logs", {...})
hf jobs logs <id>
fetch_job_logs(job_id)
Cancel job
hf_jobs("cancel", {...})
hf jobs cancel <id>
cancel_job(job_id)
Schedule UV
hf_jobs("scheduled uv", {...})
hf jobs scheduled uv run SCHEDULE script.py
create_scheduled_uv_job()
Schedule Docker
hf_jobs("scheduled run", {...})
hf jobs scheduled run SCHEDULE image cmd
create_scheduled_job()
List scheduled
hf_jobs("scheduled ps")
hf jobs scheduled ps
list_scheduled_jobs()
Delete scheduled
hf_jobs("scheduled delete", {...})
hf jobs scheduled delete <id>
delete_scheduled_job()
操作MCP工具CLIPython API
运行UV脚本
hf_jobs("uv", {...})
hf jobs uv run script.py
run_uv_job("script.py")
运行Docker作业
hf_jobs("run", {...})
hf jobs run image cmd
run_job(image, command)
列出作业
hf_jobs("ps")
hf jobs ps
list_jobs()
查看日志
hf_jobs("logs", {...})
hf jobs logs <id>
fetch_job_logs(job_id)
取消作业
hf_jobs("cancel", {...})
hf jobs cancel <id>
cancel_job(job_id)
定时运行UV脚本
hf_jobs("scheduled uv", {...})
hf jobs scheduled uv run SCHEDULE script.py
create_scheduled_uv_job()
定时运行Docker作业
hf_jobs("scheduled run", {...})
hf jobs scheduled run SCHEDULE image cmd
create_scheduled_job()
列出定时作业
hf_jobs("scheduled ps")
hf jobs scheduled ps
list_scheduled_jobs()
删除定时作业
hf_jobs("scheduled delete", {...})
hf jobs scheduled delete <id>
delete_scheduled_job()