cloudflare-workers-ai

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cloudflare Workers AI

Cloudflare Workers AI

Status: Production Ready ✅ Last Updated: 2026-01-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.58.0, @cloudflare/workers-types@4.20260109.0, workers-ai-provider@3.0.2
Recent Updates (2025):
  • April 2025 - Performance: Llama 3.3 70B 2-4x faster (speculative decoding, prefix caching), BGE embeddings 2x faster
  • April 2025 - Breaking Changes: max_tokens now correctly defaults to 256 (was not respected), BGE pooling parameter (cls NOT backwards compatible with mean)
  • 2025 - New Models (14): Mistral 3.1 24B (vision+tools), Gemma 3 12B (128K context), EmbeddingGemma 300M, Llama 4 Scout, GPT-OSS 120B/20B, Qwen models (QwQ 32B, Coder 32B), Leonardo image gen, Deepgram Aura 2, Whisper v3 Turbo, IBM Granite, Nova 3
  • 2025 - Platform: Context windows API change (tokens not chars), unit-based pricing with per-model granularity, workers-ai-provider v3.0.2 (AI SDK v5), LoRA rank up to 32 (was 8), 100 adapters per account
  • October 2025: Model deprecations (use Llama 4, GPT-OSS instead)

状态:已就绪可用于生产环境 ✅ 最后更新时间:2026-01-21 依赖项:cloudflare-worker-base(用于Worker配置) 最新版本:wrangler@4.58.0, @cloudflare/workers-types@4.20260109.0, workers-ai-provider@3.0.2
2025年近期更新
  • 2025年4月 - 性能提升:Llama 3.3 70B速度提升2-4倍(采用投机解码、前缀缓存技术),BGE嵌入速度提升2倍
  • 2025年4月 - 破坏性变更:max_tokens参数现在默认值为256(此前未生效),BGE池化参数(cls模式不兼容旧版mean模式)
  • 2025年新增模型(14个):Mistral 3.1 24B(支持视觉+工具调用)、Gemma 3 12B(128K上下文窗口)、EmbeddingGemma 300M、Llama 4 Scout、GPT-OSS 120B/20B、Qwen系列模型(QwQ 32B、Coder 32B)、Leonardo图像生成模型、Deepgram Aura 2、Whisper v3 Turbo、IBM Granite、Nova 3
  • 2025年平台更新:上下文窗口API变更(基于令牌而非字符)、按模型粒度的单元计费模式、workers-ai-provider v3.0.2(AI SDK v5)、LoRA秩提升至32(此前为8)、每个账户支持100个适配器
  • 2025年10月:部分模型停用(建议使用Llama 4、GPT-OSS替代)

Quick Start (5 Minutes)

快速入门(5分钟)

typescript
// 1. Add AI binding to wrangler.jsonc
{ "ai": { "binding": "AI" } }

// 2. Run model with streaming (recommended)
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [{ role: 'user', content: 'Tell me a story' }],
      stream: true, // Always stream for text generation!
    });

    return new Response(stream, {
      headers: { 'content-type': 'text/event-stream' },
    });
  },
};
Why streaming? Prevents buffering in memory, faster time-to-first-token, avoids Worker timeout issues.

typescript
// 1. 在wrangler.jsonc中添加AI绑定
{ "ai": { "binding": "AI" } }

// 2. 以流式传输方式运行模型(推荐)
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [{ role: 'user', content: '给我讲个故事' }],
      stream: true, // 文本生成建议始终启用流式传输!
    });

    return new Response(stream, {
      headers: { 'content-type': 'text/event-stream' },
    });
  },
};
为什么选择流式传输? 避免内存缓冲,缩短首令牌响应时间,防止Worker超时问题。

Known Issues Prevention

已知问题预防

This skill prevents 7 documented issues:
本指南可预防7种已记录的问题:

Issue #1: Context Window Validation Changed to Tokens (February 2025)

问题1:上下文窗口验证改为基于令牌(2025年2月)

Error:
"Exceeded character limit"
despite model supporting larger context Source: Cloudflare Changelog Why It Happens: Before February 2025, Workers AI validated prompts using a hard 6144 character limit, even for models with larger token-based context windows (e.g., Mistral with 32K tokens). After the update, validation switched to token-based counting. Prevention: Calculate tokens (not characters) when checking context window limits.
typescript
import { encode } from 'gpt-tokenizer'; // or model-specific tokenizer

const tokens = encode(prompt);
const contextWindow = 32768; // Model's max tokens (check docs)
const maxResponseTokens = 2048;

if (tokens.length + maxResponseTokens > contextWindow) {
  throw new Error(`Prompt exceeds context window: ${tokens.length} tokens`);
}

const response = await env.AI.run('@cf/mistral/mistral-7b-instruct-v0.2', {
  messages: [{ role: 'user', content: prompt }],
  max_tokens: maxResponseTokens,
});
错误:尽管模型支持更大的上下文,仍提示“超出字符限制” 来源Cloudflare更新日志 问题原因:2025年2月之前,Workers AI使用固定的6144字符限制验证提示词,即使模型支持更大的基于令牌的上下文窗口(例如支持32K令牌的Mistral模型)。更新后,验证方式切换为基于令牌计数。 预防措施:检查上下文窗口限制时计算令牌数(而非字符数)。
typescript
import { encode } from 'gpt-tokenizer'; // 或模型专属的令牌计算器

const tokens = encode(prompt);
const contextWindow = 32768; // 模型的最大令牌数(请查阅文档)
const maxResponseTokens = 2048;

if (tokens.length + maxResponseTokens > contextWindow) {
  throw new Error(`提示词超出上下文窗口限制:${tokens.length}个令牌`);
}

const response = await env.AI.run('@cf/mistral/mistral-7b-instruct-v0.2', {
  messages: [{ role: 'user', content: prompt }],
  max_tokens: maxResponseTokens,
});

Issue #2: Neuron Consumption Discrepancies in Dashboard

问题2:控制台中神经元消耗数据不一致

Error: Dashboard neuron usage significantly exceeds expected token-based calculations Source: Cloudflare Community Discussion Why It Happens: Users report dashboard showing hundred-million-level neuron consumption for K-level token usage, particularly with AutoRAG features and certain models. The discrepancy between expected neuron consumption (based on pricing docs) and actual dashboard metrics is not fully documented. Prevention: Monitor neuron usage via AI Gateway logs and correlate with requests. File support ticket if consumption significantly exceeds expectations.
typescript
// Use AI Gateway for detailed request logging
const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { messages: [{ role: 'user', content: query }] },
  { gateway: { id: 'my-gateway' } }
);

// Monitor dashboard at: https://dash.cloudflare.com → AI → Workers AI
// Compare neuron usage with token counts
// File support ticket with details if discrepancy persists
错误:控制台显示的神经元使用量远高于基于令牌计算的预期值 来源Cloudflare社区讨论 问题原因:用户反馈,在使用AutoRAG功能和特定模型时,控制台显示的神经元消耗达到数亿级别,而实际令牌使用量仅为数千级别。基于定价文档计算的预期神经元消耗与控制台实际指标之间的差异尚未完全记录。 预防措施:通过AI Gateway日志监控神经元使用情况,并与请求关联。如果消耗远高于预期,提交支持工单。
typescript
// 使用AI Gateway进行详细请求日志记录
const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { messages: [{ role: 'user', content: query }] },
  { gateway: { id: 'my-gateway' } }
);

// 在以下地址监控控制台:https://dash.cloudflare.com → AI → Workers AI
// 对比神经元使用量与令牌计数
// 如果差异持续存在,提交包含详细信息的支持工单

Issue #3: AI Binding Requires Remote or Latest Tooling in Local Dev

问题3:本地开发中AI绑定需要远程或最新工具

Error:
"MiniflareCoreError: wrapped binding module can't be resolved (internal modules only)"
Source: GitHub Issue #6796 Why It Happens: When using Workers AI bindings with Miniflare in local development (particularly with custom Vite plugins), the AI binding requires external workers that aren't properly exposed by older
unstable_getMiniflareWorkerOptions
. The error occurs when Miniflare can't resolve the internal AI worker module. Prevention: Use remote bindings for AI in local dev, or update to latest @cloudflare/vite-plugin.
jsonc
// wrangler.jsonc - Option 1: Use remote AI binding in local dev
{
  "ai": { "binding": "AI" },
  "dev": {
    "remote": true // Use production AI binding locally
  }
}
bash
undefined
错误
"MiniflareCoreError: wrapped binding module can't be resolved (internal modules only)"
来源GitHub Issue #6796 问题原因:在本地开发中使用Miniflare运行Workers AI绑定时(尤其是配合自定义Vite插件),AI绑定需要外部Worker,但旧版
unstable_getMiniflareWorkerOptions
未正确暴露这些Worker。当Miniflare无法解析内部AI Worker模块时会触发该错误。 预防措施:在本地开发中使用远程AI绑定,或更新至最新版@cloudflare/vite-plugin。
jsonc
// wrangler.jsonc - 选项1:本地开发使用远程AI绑定
{
  "ai": { "binding": "AI" },
  "dev": {
    "remote": true // 本地使用生产环境AI绑定
  }
}
bash
undefined

Option 2: Update to latest tooling

选项2:更新至最新工具

npm install -D @cloudflare/vite-plugin@latest
npm install -D @cloudflare/vite-plugin@latest

Option 3: Use wrangler dev instead of custom Miniflare

选项3:使用wrangler dev替代自定义Miniflare

npm run dev
undefined
npm run dev
undefined

Issue #4: Flux Image Generation NSFW Filter False Positives

问题4:Flux图像生成NSFW过滤器误判

Error:
"AiError: Input prompt contains NSFW content (code 3030)"
for innocent prompts Source: Cloudflare Community Discussion Why It Happens: Flux image generation models (
@cf/black-forest-labs/flux-1-schnell
) sometimes trigger false positive NSFW content errors even with innocent single-word prompts like "hamburger". The NSFW filter can be overly sensitive without context. Prevention: Add descriptive context around potential trigger words instead of using single-word prompts.
typescript
// ❌ May trigger error 3030
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'hamburger', // Single word triggers filter
});

// ✅ Add context to avoid false positives
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A photo of a delicious large hamburger on a plate with lettuce and tomato',
  num_steps: 4,
});
错误:对于无害提示词,仍提示
"AiError: Input prompt contains NSFW content (code 3030)"
来源Cloudflare社区讨论 问题原因:Flux图像生成模型(
@cf/black-forest-labs/flux-1-schnell
)有时会对“汉堡”这类无害的单字提示词触发NSFW内容错误。NSFW过滤器在缺乏上下文时可能过于敏感。 预防措施:为潜在触发词添加描述性上下文,避免使用单字提示词。
typescript
// ❌ 可能触发错误3030
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'hamburger', // 单字提示词触发过滤器
});

// ✅ 添加上下文避免误判
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: '一盘配有生菜和番茄的美味大汉堡的照片',
  num_steps: 4,
});

Issue #5: Image Generation Error 1000 - Missing num_steps Parameter

问题5:图像生成错误1000 - 缺少num_steps参数

Error:
"Error: unexpected type 'int32' with value 'undefined' (code 1000)"
Source: Cloudflare Community Discussion Why It Happens: Image generation API calls return error code 1000 when the
num_steps
parameter is not provided, even though documentation suggests it's optional. The parameter is actually required for most Flux models. Prevention: Always include
num_steps: 4
for image generation models (typically 4 for Flux Schnell).
typescript
// ✅ Always include num_steps for image generation
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A beautiful sunset over mountains',
  num_steps: 4, // Required - typically 4 for Flux Schnell
});

// Note: FLUX.2 [klein] 4B has fixed steps=4 (cannot be adjusted)
错误
"Error: unexpected type 'int32' with value 'undefined' (code 1000)"
来源Cloudflare社区讨论 问题原因:尽管文档显示
num_steps
参数为可选,但图像生成API调用在未提供该参数时会返回错误代码1000。实际上,大多数Flux模型都需要该参数。 预防措施:图像生成模型始终添加
num_steps: 4
参数(Flux Schnell模型通常设为4)。
typescript
// ✅ 图像生成始终包含num_steps参数
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: '山间美丽的日落',
  num_steps: 4, // 必填 - Flux Schnell模型通常设为4
});

// 注意:FLUX.2 [klein] 4B模型的steps固定为4(无法调整)

Issue #6: Zod v4 Incompatibility with Structured Output Tools

问题6:Zod v4与结构化输出工具不兼容

Error: Syntax errors and failed transpilation when using Stagehand with Zod v4 Source: GitHub Issue #10798 Why It Happens: Stagehand (browser automation) and some structured output examples in Workers AI fail with Zod v4 (now default). The underlying
zod-to-json-schema
library doesn't yet support Zod v4, causing transpilation failures. Prevention: Pin Zod to v3 until zod-to-json-schema supports v4.
bash
undefined
错误:使用Stagehand配合Zod v4时出现语法错误和转译失败 来源GitHub Issue #10798 问题原因:Stagehand(浏览器自动化工具)和Workers AI中的部分结构化输出示例在使用默认的Zod v4时会失败。底层的
zod-to-json-schema
库尚不支持Zod v4,导致转译失败。 预防措施:在zod-to-json-schema支持v4之前,将Zod固定为v3版本。
bash
undefined

Install Zod v3 specifically

专门安装Zod v3

npm install zod@3
npm install zod@3

Or pin in package.json

或在package.json中固定版本

{ "dependencies": { "zod": "~3.23.8" // Pin to v3 for compatibility } }
undefined
{ "dependencies": { "zod": "~3.23.8" // 固定为v3版本以保证兼容性 } }
undefined

Issue #7: AI Gateway Cache Headers for Per-Request Control

问题7:AI Gateway的请求级缓存控制头

Not an error, but important feature: AI Gateway supports per-request cache control via HTTP headers for custom TTL, cache bypass, and custom cache keys beyond dashboard defaults. Source: AI Gateway Caching Documentation Use When: You need different caching behavior for different requests (e.g., 1 hour for expensive queries, skip cache for real-time data). Implementation: See AI Gateway Integration section below for header usage.

并非错误,但为重要功能:AI Gateway支持通过HTTP头实现请求级缓存控制,包括自定义TTL、绕过缓存、以及超出控制台默认设置的自定义缓存键。 来源AI Gateway缓存文档 适用场景:需要为不同请求设置不同缓存行为时(例如,昂贵查询缓存1小时,实时数据跳过缓存)。 实现方式:请参阅下方AI Gateway集成部分的头信息使用说明。

API Reference

API参考

typescript
env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

typescript
env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

Model Selection Guide (Updated 2025)

2025年模型选择指南

Text Generation (LLMs)

文本生成(LLM)

ModelBest ForRate LimitSizeNotes
2025 Models
@cf/meta/llama-4-scout-17b-16e-instruct
Latest Llama, general purpose300/min17BNEW 2025
@cf/openai/gpt-oss-120b
Largest open-source GPT300/min120BNEW 2025
@cf/openai/gpt-oss-20b
Smaller open-source GPT300/min20BNEW 2025
@cf/google/gemma-3-12b-it
128K context, 140+ languages300/min12BNEW 2025, vision
@cf/mistralai/mistral-small-3.1-24b-instruct
Vision + tool calling300/min24BNEW 2025
@cf/qwen/qwq-32b
Reasoning, complex tasks300/min32BNEW 2025
@cf/qwen/qwen2.5-coder-32b-instruct
Coding specialist300/min32BNEW 2025
@cf/qwen/qwen3-30b-a3b-fp8
Fast quantized300/min30BNEW 2025
@cf/ibm-granite/granite-4.0-h-micro
Small, efficient300/minMicroNEW 2025
Performance (2025)
@cf/meta/llama-3.3-70b-instruct-fp8-fast
2-4x faster (2025 update)300/min70BSpeculative decoding
@cf/meta/llama-3.1-8b-instruct-fp8-fast
Fast 8B variant300/min8B-
Standard Models
@cf/meta/llama-3.1-8b-instruct
General purpose300/min8B-
@cf/meta/llama-3.2-1b-instruct
Ultra-fast, simple tasks300/min1B-
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b
Coding, technical300/min32B-
模型适用场景速率限制模型规模说明
2025年新增模型
@cf/meta/llama-4-scout-17b-16e-instruct
最新Llama模型,通用场景300次/分钟17B2025年新增
@cf/openai/gpt-oss-120b
最大的开源GPT模型300次/分钟120B2025年新增
@cf/openai/gpt-oss-20b
轻量化开源GPT模型300次/分钟20B2025年新增
@cf/google/gemma-3-12b-it
128K上下文窗口,支持140+语言300次/分钟12B2025年新增,支持视觉
@cf/mistralai/mistral-small-3.1-24b-instruct
视觉+工具调用300次/分钟24B2025年新增
@cf/qwen/qwq-32b
推理、复杂任务300次/分钟32B2025年新增
@cf/qwen/qwen2.5-coder-32b-instruct
编码、技术场景300次/分钟32B2025年新增
@cf/qwen/qwen3-30b-a3b-fp8
快速量化模型300次/分钟30B2025年新增
@cf/ibm-granite/granite-4.0-h-micro
小型、高效300次/分钟Micro2025年新增
2025年性能优化模型
@cf/meta/llama-3.3-70b-instruct-fp8-fast
速度提升2-4倍(2025年更新)300次/分钟70B采用投机解码
@cf/meta/llama-3.1-8b-instruct-fp8-fast
快速8B变体300次/分钟8B-
标准模型
@cf/meta/llama-3.1-8b-instruct
通用场景300次/分钟8B-
@cf/meta/llama-3.2-1b-instruct
超快速、简单任务300次/分钟1B-
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b
编码、技术场景300次/分钟32B-

Text Embeddings (2x Faster - 2025)

文本嵌入(2025年速度提升2倍)

ModelDimensionsBest ForRate LimitNotes
@cf/google/embeddinggemma-300m
768Best-in-class RAG3000/minNEW 2025
@cf/baai/bge-base-en-v1.5
768General RAG (2x faster)3000/minpooling: "cls" recommended
@cf/baai/bge-large-en-v1.5
1024High accuracy (2x faster)1500/minpooling: "cls" recommended
@cf/baai/bge-small-en-v1.5
384Fast, low storage (2x faster)3000/minpooling: "cls" recommended
@cf/qwen/qwen3-embedding-0.6b
768Qwen embeddings3000/minNEW 2025
CRITICAL (2025): BGE models now support
pooling: "cls"
parameter (recommended) but NOT backwards compatible with
pooling: "mean"
(default).
模型维度适用场景速率限制说明
@cf/google/embeddinggemma-300m
768顶级RAG场景3000次/分钟2025年新增
@cf/baai/bge-base-en-v1.5
768通用RAG(速度提升2倍)3000次/分钟推荐使用pooling: "cls"
@cf/baai/bge-large-en-v1.5
1024高精度(速度提升2倍)1500次/分钟推荐使用pooling: "cls"
@cf/baai/bge-small-en-v1.5
384快速、低存储(速度提升2倍)3000次/分钟推荐使用pooling: "cls"
@cf/qwen/qwen3-embedding-0.6b
768Qwen嵌入模型3000次/分钟2025年新增
2025年重要提示:BGE模型现在支持
pooling: "cls"
参数(推荐使用),但与旧版
pooling: "mean"
(默认值)不兼容。

Image Generation

图像生成

ModelBest ForRate LimitNotes
@cf/black-forest-labs/flux-1-schnell
High quality, photorealistic720/min⚠️ See warnings below
@cf/leonardo/lucid-origin
Leonardo AI style720/minNEW 2025, requires num_steps
@cf/leonardo/phoenix-1.0
Leonardo AI variant720/minNEW 2025, requires num_steps
@cf/stabilityai/stable-diffusion-xl-base-1.0
General purpose720/minRequires num_steps
⚠️ Common Image Generation Issues:
  • Error 1000: Always include
    num_steps: 4
    parameter (required despite docs suggesting optional)
  • Error 3030 (NSFW filter): Single words like "hamburger" may trigger false positives - add descriptive context to prompts
typescript
// ✅ Correct pattern for image generation
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A photo of a delicious hamburger on a plate with fresh vegetables',
  num_steps: 4, // Required to avoid error 1000
});
// Descriptive context helps avoid NSFW false positives (error 3030)
模型适用场景速率限制说明
@cf/black-forest-labs/flux-1-schnell
高质量、照片级写实720次/分钟⚠️ 请参阅下方警告
@cf/leonardo/lucid-origin
Leonardo AI风格720次/分钟2025年新增,需要num_steps参数
@cf/leonardo/phoenix-1.0
Leonardo AI变体720次/分钟2025年新增,需要num_steps参数
@cf/stabilityai/stable-diffusion-xl-base-1.0
通用场景720次/分钟需要num_steps参数
⚠️ 图像生成常见问题:
  • 错误1000:始终包含
    num_steps: 4
    参数(尽管文档显示可选,但实际必填)
  • 错误3030(NSFW过滤器):“汉堡”这类单字可能触发误判 - 为提示词添加描述性上下文
typescript
// ✅ 图像生成正确示例
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: '一盘配有新鲜蔬菜的美味汉堡的照片',
  num_steps: 4, // 必填,避免错误1000
});
// 描述性上下文有助于避免NSFW误判(错误3030)

Vision Models

视觉模型

ModelBest ForRate LimitNotes
@cf/meta/llama-3.2-11b-vision-instruct
Image understanding720/min-
@cf/google/gemma-3-12b-it
Vision + text (128K context)300/minNEW 2025
模型适用场景速率限制说明
@cf/meta/llama-3.2-11b-vision-instruct
图像理解720次/分钟-
@cf/google/gemma-3-12b-it
视觉+文本(128K上下文窗口)300次/分钟2025年新增

Audio Models (2025)

音频模型(2025年)

ModelTypeRate LimitNotes
@cf/deepgram/aura-2-en
Text-to-speech (English)720/minNEW 2025
@cf/deepgram/aura-2-es
Text-to-speech (Spanish)720/minNEW 2025
@cf/deepgram/nova-3
Speech-to-text (+ WebSocket)720/minNEW 2025
@cf/openai/whisper-large-v3-turbo
Speech-to-text (faster)720/minNEW 2025

模型类型速率限制说明
@cf/deepgram/aura-2-en
文本转语音(英文)720次/分钟2025年新增
@cf/deepgram/aura-2-es
文本转语音(西班牙文)720次/分钟2025年新增
@cf/deepgram/nova-3
语音转文本(支持WebSocket)720次/分钟2025年新增
@cf/openai/whisper-large-v3-turbo
语音转文本(更快)720次/分钟2025年新增

Common Patterns

常见模式

RAG (Retrieval Augmented Generation)

RAG(检索增强生成)

typescript
// 1. Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });

// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// 3. Generate with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `Answer using this context:\n${context}` },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

typescript
// 1. 生成嵌入向量
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });

// 2. 搜索Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// 3. 结合上下文生成结果
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `使用以下上下文回答问题:\n${context}` },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

Structured Output with Zod

结合Zod的结构化输出

typescript
import { z } from 'zod';

const Schema = z.object({ name: z.string(), items: z.array(z.string()) });

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{
    role: 'user',
    content: `Generate JSON matching: ${JSON.stringify(Schema.shape)}`
  }],
});

const validated = Schema.parse(JSON.parse(response.response));

typescript
import { z } from 'zod';

const Schema = z.object({ name: z.string(), items: z.array(z.string()) });

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{
    role: 'user',
    content: `生成符合以下结构的JSON: ${JSON.stringify(Schema.shape)}`
  }],
});

const validated = Schema.parse(JSON.parse(response.response));

AI Gateway Integration

AI Gateway集成

Provides caching, logging, cost tracking, and analytics for AI requests.
为AI请求提供缓存、日志、成本跟踪和分析功能。

Basic Gateway Usage

基础Gateway使用

typescript
const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: 'Hello' },
  { gateway: { id: 'my-gateway', skipCache: false } }
);

// Access logs and send feedback
const gateway = env.AI.gateway('my-gateway');
await gateway.patchLog(env.AI.aiGatewayLogId, {
  feedback: { rating: 1, comment: 'Great response' },
});
typescript
const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: '你好' },
  { gateway: { id: 'my-gateway', skipCache: false } }
);

// 访问日志并提交反馈
const gateway = env.AI.gateway('my-gateway');
await gateway.patchLog(env.AI.aiGatewayLogId, {
  feedback: { rating: 1, comment: '回复很棒' },
});

Per-Request Cache Control (Advanced)

请求级缓存控制(进阶)

Override default cache behavior with HTTP headers for fine-grained control:
typescript
// Custom cache TTL (1 hour for expensive queries)
const response = await fetch(
  `https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}/workers-ai/@cf/meta/llama-3.1-8b-instruct`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${env.CLOUDFLARE_API_KEY}`,
      'Content-Type': 'application/json',
      'cf-aig-cache-ttl': '3600', // 1 hour in seconds (min: 60, max: 2592000)
    },
    body: JSON.stringify({
      messages: [{ role: 'user', content: prompt }],
    }),
  }
);

// Skip cache for real-time data
const response = await fetch(gatewayUrl, {
  headers: {
    'cf-aig-skip-cache': 'true', // Bypass cache entirely
  },
  // ...
});

// Check if response was cached
const cacheStatus = response.headers.get('cf-aig-cache-status'); // "HIT" or "MISS"
Available Cache Headers:
  • cf-aig-cache-ttl
    : Set custom TTL in seconds (60s to 1 month)
  • cf-aig-skip-cache
    : Bypass cache entirely (
    'true'
    )
  • cf-aig-cache-key
    : Custom cache key for granular control
  • cf-aig-cache-status
    : Response header showing
    "HIT"
    or
    "MISS"
Benefits: Cost tracking, caching (reduces duplicate inference), logging, rate limiting, analytics, per-request cache customization.

通过HTTP头覆盖默认缓存行为,实现细粒度控制:
typescript
// 自定义缓存TTL(昂贵查询缓存1小时)
const response = await fetch(
  `https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}/workers-ai/@cf/meta/llama-3.1-8b-instruct`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${env.CLOUDFLARE_API_KEY}`,
      'Content-Type': 'application/json',
      'cf-aig-cache-ttl': '3600', // 缓存1小时(单位:秒,最小值60,最大值2592000)
    },
    body: JSON.stringify({
      messages: [{ role: 'user', content: prompt }],
    }),
  }
);

// 实时数据跳过缓存
const response = await fetch(gatewayUrl, {
  headers: {
    'cf-aig-skip-cache': 'true', // 完全绕过缓存
  },
  // ...
});

// 检查响应是否来自缓存
const cacheStatus = response.headers.get('cf-aig-cache-status'); // "HIT" 或 "MISS"
可用缓存头:
  • cf-aig-cache-ttl
    :设置自定义缓存TTL(秒)
  • cf-aig-skip-cache
    :完全绕过缓存(值为
    'true'
  • cf-aig-cache-key
    :自定义缓存键,实现更细粒度控制
  • cf-aig-cache-status
    :响应头,显示缓存命中状态("HIT"或"MISS")
优势: 成本跟踪、缓存(减少重复推理)、日志记录、速率限制、分析、请求级缓存自定义。

Rate Limits & Pricing (Updated 2025)

速率限制与定价(2025年更新)

Rate Limits (per minute)

速率限制(每分钟)

Task TypeDefault LimitNotes
Text Generation300/minSome fast models: 400-1500/min
Text Embeddings3000/minBGE-large: 1500/min
Image Generation720/minAll image models
Vision Models720/minImage understanding
Audio (TTS/STT)720/minDeepgram, Whisper
Translation720/minM2M100, Opus MT
Classification2000/minText classification
任务类型默认限制说明
文本生成300次/分钟部分快速模型:400-1500次/分钟
文本嵌入3000次/分钟BGE-large:1500次/分钟
图像生成720次/分钟所有图像模型
视觉模型720次/分钟图像理解
音频(TTS/STT)720次/分钟Deepgram、Whisper
翻译720次/分钟M2M100、Opus MT
分类2000次/分钟文本分类

Pricing (Unit-Based, Billed in Neurons - 2025)

定价(基于单元,按神经元计费 - 2025年)

Free Tier:
  • 10,000 neurons per day
  • Resets daily at 00:00 UTC
Paid Tier ($0.011 per 1,000 neurons):
  • 10,000 neurons/day included
  • Unlimited usage above free allocation
2025 Model Costs (per 1M tokens):
ModelInputOutputNotes
2025 Models
Llama 4 Scout 17B$0.270$0.850NEW 2025
GPT-OSS 120B$0.350$0.750NEW 2025
GPT-OSS 20B$0.200$0.300NEW 2025
Gemma 3 12B$0.345$0.556NEW 2025
Mistral 3.1 24B$0.351$0.555NEW 2025
Qwen QwQ 32B$0.660$1.000NEW 2025
Qwen Coder 32B$0.660$1.000NEW 2025
IBM Granite Micro$0.017$0.112NEW 2025
EmbeddingGemma 300M$0.012N/ANEW 2025
Qwen3 Embedding 0.6B$0.012N/ANEW 2025
Performance (2025)
Llama 3.3 70B Fast$0.293$2.2532-4x faster
Llama 3.1 8B FP8 Fast$0.045$0.384Fast variant
Standard Models
Llama 3.2 1B$0.027$0.201-
Llama 3.1 8B$0.282$0.827-
Deepseek R1 32B$0.497$4.881-
BGE-base (2x faster)$0.067N/A2025 speedup
BGE-large (2x faster)$0.204N/A2025 speedup
Image Models (2025)
Flux 1 Schnell$0.0000528 per 512x512 tile-
Leonardo Lucid$0.006996 per 512x512 tileNEW 2025
Leonardo Phoenix$0.005830 per 512x512 tileNEW 2025
Audio Models (2025)
Deepgram Aura 2$0.030 per 1k charsNEW 2025
Deepgram Nova 3$0.0052 per audio minNEW 2025
Whisper v3 Turbo$0.0005 per audio minNEW 2025

免费额度:
  • 每日10,000个神经元
  • 每日00:00 UTC重置
付费额度(每1000个神经元0.011美元):
  • 包含每日10,000个神经元
  • 超出免费额度后无限制使用
2025年模型成本(每百万令牌):
模型输入输出说明
2025年新增模型
Llama 4 Scout 17B$0.270$0.8502025年新增
GPT-OSS 120B$0.350$0.7502025年新增
GPT-OSS 20B$0.200$0.3002025年新增
Gemma 3 12B$0.345$0.5562025年新增
Mistral 3.1 24B$0.351$0.5552025年新增
Qwen QwQ 32B$0.660$1.0002025年新增
Qwen Coder 32B$0.660$1.0002025年新增
IBM Granite Micro$0.017$0.1122025年新增
EmbeddingGemma 300M$0.012N/A2025年新增
Qwen3 Embedding 0.6B$0.012N/A2025年新增
2025年性能优化模型
Llama 3.3 70B Fast$0.293$2.253速度提升2-4倍
Llama 3.1 8B FP8 Fast$0.045$0.384快速变体
标准模型
Llama 3.2 1B$0.027$0.201-
Llama 3.1 8B$0.282$0.827-
Deepseek R1 32B$0.497$4.881-
BGE-base(速度提升2倍)$0.067N/A2025年提速
BGE-large(速度提升2倍)$0.204N/A2025年提速
图像模型(2025年)
Flux 1 Schnell每512x512像素块$0.0000528-
Leonardo Lucid每512x512像素块$0.0069962025年新增
Leonardo Phoenix每512x512像素块$0.0058302025年新增
音频模型(2025年)
Deepgram Aura 2每1000字符$0.0302025年新增
Deepgram Nova 3每音频分钟$0.00522025年新增
Whisper v3 Turbo每音频分钟$0.00052025年新增

Error Handling with Retry

带重试的错误处理

typescript
async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;

      // Rate limit - retry with exponential backoff
      if (lastError.message.toLowerCase().includes('rate limit')) {
        await new Promise((resolve) => setTimeout(resolve, Math.pow(2, i) * 1000));
        continue;
      }

      throw error; // Other errors - fail immediately
    }
  }

  throw lastError!;
}

typescript
async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;

      // 速率限制 - 指数退避重试
      if (lastError.message.toLowerCase().includes('rate limit')) {
        await new Promise((resolve) => setTimeout(resolve, Math.pow(2, i) * 1000));
        continue;
      }

      throw error; // 其他错误 - 立即失败
    }
  }

  throw lastError!;
}

OpenAI Compatibility

OpenAI兼容性

typescript
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/ai/v1`,
});

// Chat completions
await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});
Endpoints:
/v1/chat/completions
,
/v1/embeddings

typescript
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/ai/v1`,
});

// 聊天补全
await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: '你好!' }],
});
支持的端点:
/v1/chat/completions
,
/v1/embeddings

Vercel AI SDK Integration (workers-ai-provider v3.0.2)

Vercel AI SDK集成(workers-ai-provider v3.0.2)

typescript
import { createWorkersAI } from 'workers-ai-provider'; // v3.0.2 with AI SDK v5
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// Generate or stream
await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Write a poem',
});

typescript
import { createWorkersAI } from 'workers-ai-provider'; // 配合AI SDK v5的v3.0.2版本
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// 生成或流式传输结果
await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: '写一首诗',
});

Community Tips

社区技巧

Note: These tips come from community discussions and production experience.
注意:这些技巧来自社区讨论和生产实践经验。

Hono Framework Streaming Pattern

Hono框架流式传输模式

When using Workers AI streaming with Hono, return the stream directly as a Response (not through Hono's streaming utilities):
typescript
import { Hono } from 'hono';

type Bindings = { AI: Ai };
const app = new Hono<{ Bindings: Bindings }>();

app.post('/chat', async (c) => {
  const { prompt } = await c.req.json();

  const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  // Return stream directly (not c.stream())
  return new Response(stream, {
    headers: {
      'content-type': 'text/event-stream',
      'cache-control': 'no-cache',
      'connection': 'keep-alive',
    },
  });
});
在Hono中使用Workers AI流式传输时,直接将流作为Response返回(不要通过Hono的流式工具):
typescript
import { Hono } from 'hono';

type Bindings = { AI: Ai };
const app = new Hono<{ Bindings: Bindings }>();

app.post('/chat', async (c) => {
  const { prompt } = await c.req.json();

  const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  // 直接返回流(不要使用c.stream())
  return new Response(stream, {
    headers: {
      'content-type': 'text/event-stream',
      'cache-control': 'no-cache',
      'connection': 'keep-alive',
    },
  });
});

Troubleshooting Unexplained AI Binding Failures

排查无法解释的AI绑定失败

If experiencing unexplained Workers AI failures:
bash
undefined
如果遇到无法解释的Workers AI失败:
bash
undefined

1. Check wrangler version

1. 检查wrangler版本

npx wrangler --version
npx wrangler --version

2. Clear wrangler cache

2. 清除wrangler缓存

rm -rf ~/.wrangler
rm -rf ~/.wrangler

3. Update to latest stable

3. 更新至最新稳定版

npm install -D wrangler@latest
npm install -D wrangler@latest

4. Check local network/firewall settings

4. 检查本地网络/防火墙设置

Some corporate firewalls block Workers AI endpoints

部分企业防火墙会阻止Workers AI端点


**Note**: Most "version incompatibility" issues turn out to be network configuration problems.

---

**注意**:大多数“版本不兼容”问题实际上是网络配置问题。

---

References

参考资料