groq-inference
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<objective>
Enable ultra-fast LLM inference (10-100x faster than standard providers) using GROQ API for real-time applications including chat, vision, audio (STT/TTS), tool use, and reasoning models. Critical for voice agents and low-latency AI.
</objective>
<quick_start>
Basic chat with GROQ:
python
from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
response = client.chat.completions.create(
model="llama-3.3-70b-versatile", # Best all-around
messages=[{"role": "user", "content": prompt}],
)Model selection:
| Use Case | Model |
|---|---|
| General chat | |
| Vision/OCR | |
| STT | |
| TTS | |
| </quick_start> |
<success_criteria>
GROQ integration is successful when:
- Correct model selected for use case (see model table)
- API key in environment variable ()
GROQ_API_KEY - Retry logic with tenacity for rate limits
- Streaming enabled for real-time applications
- Async patterns used for parallel queries
- NOT using OpenAI (constraint: NO OPENAI) </success_criteria>
<core_content>
Ultra-fast LLM inference for real-time applications. GROQ delivers 10-100x faster inference than standard providers.
<objective>
借助GROQ API为实时应用(包括聊天、视觉、音频(STT/TTS)、工具调用和推理模型)实现超快速LLM推理(比标准提供商快10-100倍)。这对语音Agent和低延迟AI至关重要。
</objective>
<quick_start>
使用GROQ进行基础聊天:
python
from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
response = client.chat.completions.create(
model="llama-3.3-70b-versatile", # Best all-around
messages=[{"role": "user", "content": prompt}],
)模型选择:
| 使用场景 | 模型 |
|---|---|
| 通用聊天 | |
| 视觉/OCR | |
| 语音转文本(STT) | |
| 文本转语音(TTS) | |
| </quick_start> |
<success_criteria>
GROQ集成成功的判定标准:
- 根据使用场景选择正确的模型(参考模型表格)
- 环境变量中配置API密钥()
GROQ_API_KEY - 使用tenacity实现速率限制的重试逻辑
- 为实时应用启用流式传输
- 异步模式用于并行查询
- 不使用OpenAI(约束:禁止使用OPENAI) </success_criteria>
<core_content>
为实时应用提供超快速LLM推理。GROQ的推理速度比标准提供商快10-100倍。
Quick Reference: Model Selection
快速参考:模型选择
| Use Case | Model ID | Context | Notes |
|---|---|---|---|
| General Chat | | 128K | Best all-around |
| Fast Chat | | 128K | Simple tasks, fastest |
| Vision/OCR | | 128K | Up to 5 images |
| STT | | 448 | GROQ-hosted (NOT OpenAI API) |
| TTS | | - | Fritz-PlayAI voice |
| Reasoning | | 128K | Thinking models |
| Tool Use | | - | Built-in web search, code exec |
| 使用场景 | 模型ID | 上下文窗口 | 说明 |
|---|---|---|---|
| 通用聊天 | | 128K | 全能首选 |
| 快速聊天 | | 128K | 简单任务,速度最快 |
| 视觉/OCR | | 128K | 最多支持5张图片 |
| 语音转文本(STT) | | 448 | GROQ托管(非OpenAI API) |
| 文本转语音(TTS) | | - | Fritz-PlayAI音色 |
| 推理任务 | | 128K | 推理专用模型 |
| 工具调用 | | - | 内置网页搜索、代码执行 |
Core Patterns
核心模式
1. Chat Completion (Basic + Streaming)
1. 聊天补全(基础版 + 流式版)
python
import os
from groq import Groq, AsyncGroq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
def chat(prompt: str, system: str = "You are helpful.") -> str:
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
temperature=0.7,
max_completion_tokens=1024,
)
return response.choices[0].message.contentpython
import os
from groq import Groq, AsyncGroq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
def chat(prompt: str, system: str = "You are helpful.") -> str:
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
temperature=0.7,
max_completion_tokens=1024,
)
return response.choices[0].message.contentStreaming
Streaming
def stream_chat(prompt: str):
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
undefineddef stream_chat(prompt: str):
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
undefined2. Vision / Multimodal
2. 视觉 / 多模态
python
import base64
def analyze_image(image_path: str, prompt: str) -> str:
with open(image_path, "rb") as f:
image_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="meta-llama/llama-4-scout-17b-16e-instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
]
}],
)
return response.choices[0].message.contentpython
import base64
def analyze_image(image_path: str, prompt: str) -> str:
with open(image_path, "rb") as f:
image_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="meta-llama/llama-4-scout-17b-16e-instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
]
}],
)
return response.choices[0].message.contentURL-based: just pass {"url": "https://..."} instead of base64
URL-based: just pass {"url": "https://..."} instead of base64
undefinedundefined3. Audio: Speech-to-Text (GROQ-Hosted Whisper)
3. 音频:语音转文本(GROQ托管的Whisper)
Note: Whisper on GROQ runs on GROQ hardware - NOT calling OpenAI's API. Whisper is an open-source model that GROQ hosts for fast inference.
python
def transcribe(audio_path: str, language: str = "en") -> str:
with open(audio_path, "rb") as f:
result = client.audio.transcriptions.create(
file=f,
model="whisper-large-v3", # GROQ-hosted, not OpenAI API
language=language,
response_format="verbose_json", # Includes timestamps
)
return result.text
def translate_to_english(audio_path: str) -> str:
with open(audio_path, "rb") as f:
result = client.audio.translations.create(file=f, model="whisper-large-v3")
return result.textAlternative STT Providers (if you prefer non-Whisper options):
- Deepgram - Real-time streaming, lowest latency ()
pip install deepgram-sdk - AssemblyAI - High accuracy, speaker diarization ()
pip install assemblyai - See for Deepgram/AssemblyAI integration patterns
voice-ai-skill
注意: GROQ上的Whisper运行在GROQ硬件上——并非调用OpenAI的API。 Whisper是GROQ托管的开源模型,可实现快速推理。
python
def transcribe(audio_path: str, language: str = "en") -> str:
with open(audio_path, "rb") as f:
result = client.audio.transcriptions.create(
file=f,
model="whisper-large-v3", # GROQ-hosted, not OpenAI API
language=language,
response_format="verbose_json", # Includes timestamps
)
return result.text
def translate_to_english(audio_path: str) -> str:
with open(audio_path, "rb") as f:
result = client.audio.translations.create(file=f, model="whisper-large-v3")
return result.text替代STT提供商(如果偏好非Whisper选项):
- Deepgram - 实时流式传输,延迟最低()
pip install deepgram-sdk - AssemblyAI - 高精度,支持说话人分离()
pip install assemblyai - 查看获取Deepgram/AssemblyAI的集成模式
voice-ai-skill
4. Audio: Text-to-Speech (PlayAI)
4. 音频:文本转语音(PlayAI)
python
def text_to_speech(text: str, output_path: str = "output.wav"):
response = client.audio.speech.create(
model="playai-tts",
voice="Fritz-PlayAI", # Also: Arista-PlayAI
input=text,
response_format="wav",
)
response.write_to_file(output_path)python
def text_to_speech(text: str, output_path: str = "output.wav"):
response = client.audio.speech.create(
model="playai-tts",
voice="Fritz-PlayAI", # Also: Arista-PlayAI
input=text,
response_format="wav",
)
response.write_to_file(output_path)Streaming TTS
Streaming TTS
def stream_tts(text: str):
with client.audio.speech.with_streaming_response.create(
model="playai-tts", voice="Fritz-PlayAI", input=text, response_format="wav"
) as response:
for chunk in response.iter_bytes(1024):
yield chunk
**Alternative TTS Providers** (beyond GROQ's PlayAI):
- **Cartesia** - Ultra-low latency, emotional control (`pip install cartesia`)
- **ElevenLabs** - Most natural voices, voice cloning (`pip install elevenlabs`)
- **Deepgram** - Fast, cost-effective (`pip install deepgram-sdk`)
- See `voice-ai-skill` for Cartesia/ElevenLabs/Deepgram TTS integration patternsdef stream_tts(text: str):
with client.audio.speech.with_streaming_response.create(
model="playai-tts", voice="Fritz-PlayAI", input=text, response_format="wav"
) as response:
for chunk in response.iter_bytes(1024):
yield chunk
**替代TTS提供商**(GROQ PlayAI之外的选项):
- **Cartesia** - 超低延迟,支持情绪控制(`pip install cartesia`)
- **ElevenLabs** - 最自然的音色,支持语音克隆(`pip install elevenlabs`)
- **Deepgram** - 快速且性价比高(`pip install deepgram-sdk`)
- 查看`voice-ai-skill`获取Cartesia/ElevenLabs/Deepgram的TTS集成模式5. Tool Use / Function Calling
5. 工具调用 / 函数调用
python
import json
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
def chat_with_tools(prompt: str):
messages = [{"role": "user", "content": prompt}]
response = client.chat.completions.create(
model="llama-3.3-70b-versatile", messages=messages, tools=tools, tool_choice="auto"
)
msg = response.choices[0].message
if msg.tool_calls:
for tc in msg.tool_calls:
result = execute_function(tc.function.name, json.loads(tc.function.arguments))
messages.extend([msg, {"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)}])
return client.chat.completions.create(model="llama-3.3-70b-versatile", messages=messages, tools=tools).choices[0].message.content
return msg.contentpython
import json
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
def chat_with_tools(prompt: str):
messages = [{"role": "user", "content": prompt}]
response = client.chat.completions.create(
model="llama-3.3-70b-versatile", messages=messages, tools=tools, tool_choice="auto"
)
msg = response.choices[0].message
if msg.tool_calls:
for tc in msg.tool_calls:
result = execute_function(tc.function.name, json.loads(tc.function.arguments))
messages.extend([msg, {"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)}])
return client.chat.completions.create(model="llama-3.3-70b-versatile", messages=messages, tools=tools).choices[0].message.content
return msg.content6. Compound Beta (Built-in Web Search + Code Exec)
6. Compound Beta(内置网页搜索 + 代码执行)
python
def compound_query(prompt: str):
"""Built-in tools: web_search, code_execution."""
response = client.chat.completions.create(
model="compound-beta",
messages=[{"role": "user", "content": prompt}],
)
msg = response.choices[0].message
# Access msg.executed_tools for tool results
return msg.contentpython
def compound_query(prompt: str):
"""Built-in tools: web_search, code_execution."""
response = client.chat.completions.create(
model="compound-beta",
messages=[{"role": "user", "content": prompt}],
)
msg = response.choices[0].message
# Access msg.executed_tools for tool results
return msg.content7. Reasoning Models
7. 推理模型
python
def reasoning_query(prompt: str, format: str = "parsed"):
"""format: 'parsed' (structured), 'raw' (visible), 'hidden' (no thinking)"""
response = client.chat.completions.create(
model="meta-llama/llama-4-maverick-17b-128e-instruct",
messages=[{"role": "user", "content": prompt}],
reasoning_format=format,
)
msg = response.choices[0].message
if format == "parsed" and hasattr(msg, 'reasoning'):
return {"thinking": msg.reasoning, "answer": msg.content}
return msg.contentpython
def reasoning_query(prompt: str, format: str = "parsed"):
"""format: 'parsed' (structured), 'raw' (visible), 'hidden' (no thinking)"""
response = client.chat.completions.create(
model="meta-llama/llama-4-maverick-17b-128e-instruct",
messages=[{"role": "user", "content": prompt}],
reasoning_format=format,
)
msg = response.choices[0].message
if format == "parsed" and hasattr(msg, 'reasoning'):
return {"thinking": msg.reasoning, "answer": msg.content}
return msg.content8. Async Patterns
8. 异步模式
python
async_client = AsyncGroq(api_key=os.environ.get("GROQ_API_KEY"))
async def async_chat(prompt: str) -> str:
response = await async_client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
async def parallel_queries(prompts: list[str]) -> list[str]:
import asyncio
return await asyncio.gather(*[async_chat(p) for p in prompts])python
async_client = AsyncGroq(api_key=os.environ.get("GROQ_API_KEY"))
async def async_chat(prompt: str) -> str:
response = await async_client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
async def parallel_queries(prompts: list[str]) -> list[str]:
import asyncio
return await asyncio.gather(*[async_chat(p) for p in prompts])Rate Limits
速率限制
| Tier | Requests/min | Tokens/min | Tokens/day |
|---|---|---|---|
| Free | 30 | 15,000 | 500,000 |
| Paid | 100+ | 100,000+ | Unlimited |
python
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def reliable_chat(prompt: str) -> str:
return chat(prompt)| 等级 | 每分钟请求数 | 每分钟令牌数 | 每日令牌数 |
|---|---|---|---|
| 免费版 | 30 | 15,000 | 500,000 |
| 付费版 | 100+ | 100,000+ | 无限制 |
python
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def reliable_chat(prompt: str) -> str:
return chat(prompt)Integration Notes
集成说明
- Pairs with: voice-ai-skill (Whisper STT + PlayAI TTS), langgraph-agents-skill
- Complements: trading-signals-skill (fast analysis), data-analysis-skill
- Projects: VozLux (voice agents), FieldVault-AI (document processing)
- Constraint: NO OPENAI - GROQ is the fast inference layer
- 搭配使用:voice-ai-skill(Whisper语音转文本 + PlayAI文本转语音)、langgraph-agents-skill
- 补充工具:trading-signals-skill(快速分析)、data-analysis-skill
- 适用项目:VozLux(语音Agent)、FieldVault-AI(文档处理)
- 约束条件:禁止使用OPENAI - GROQ作为快速推理层
Environment Variables
环境变量
bash
GROQ_API_KEY=gsk_... # Required - get from console.groq.combash
GROQ_API_KEY=gsk_... # 必填 - 从console.groq.com获取Optional multi-provider
可选多提供商配置
ANTHROPIC_API_KEY= # Claude for complex reasoning
GOOGLE_API_KEY= # Gemini fallback
undefinedANTHROPIC_API_KEY= # 用于复杂推理的Claude
GOOGLE_API_KEY= # 备用的Gemini
undefinedReference Files
参考文档
- - Complete model catalog with specs
reference/models-catalog.md - - Whisper STT and PlayAI TTS deep dive
reference/audio-speech.md - - Multimodal and image processing
reference/vision-multimodal.md - - Function calling and Compound Beta
reference/tool-use-patterns.md - - Thinking models and reasoning_format
reference/reasoning-models.md - - Batch API, caching, provider routing
reference/cost-optimization.md
- - 完整模型目录及规格
reference/models-catalog.md - - Whisper语音转文本和PlayAI文本转语音深度解析
reference/audio-speech.md - - 多模态与图像处理
reference/vision-multimodal.md - - 函数调用与Compound Beta
reference/tool-use-patterns.md - - 推理模型与reasoning_format
reference/reasoning-models.md - - 批量API、缓存、提供商路由
reference/cost-optimization.md