groq-inference

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<objective> Enable ultra-fast LLM inference (10-100x faster than standard providers) using GROQ API for real-time applications including chat, vision, audio (STT/TTS), tool use, and reasoning models. Critical for voice agents and low-latency AI. </objective>
<quick_start> Basic chat with GROQ:
python
from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",  # Best all-around
    messages=[{"role": "user", "content": prompt}],
)
Model selection:
Use CaseModel
General chat
llama-3.3-70b-versatile
Vision/OCR
meta-llama/llama-4-scout-17b-16e-instruct
STT
whisper-large-v3
(GROQ-hosted, NOT OpenAI)
TTS
playai-tts
</quick_start>
<success_criteria> GROQ integration is successful when:
  • Correct model selected for use case (see model table)
  • API key in environment variable (
    GROQ_API_KEY
    )
  • Retry logic with tenacity for rate limits
  • Streaming enabled for real-time applications
  • Async patterns used for parallel queries
  • NOT using OpenAI (constraint: NO OPENAI) </success_criteria>
<core_content> Ultra-fast LLM inference for real-time applications. GROQ delivers 10-100x faster inference than standard providers.
<objective> 借助GROQ API为实时应用(包括聊天、视觉、音频(STT/TTS)、工具调用和推理模型)实现超快速LLM推理(比标准提供商快10-100倍)。这对语音Agent和低延迟AI至关重要。 </objective>
<quick_start> 使用GROQ进行基础聊天:
python
from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",  # Best all-around
    messages=[{"role": "user", "content": prompt}],
)
模型选择:
使用场景模型
通用聊天
llama-3.3-70b-versatile
视觉/OCR
meta-llama/llama-4-scout-17b-16e-instruct
语音转文本(STT)
whisper-large-v3
(GROQ托管,非OpenAI)
文本转语音(TTS)
playai-tts
</quick_start>
<success_criteria> GROQ集成成功的判定标准:
  • 根据使用场景选择正确的模型(参考模型表格)
  • 环境变量中配置API密钥(
    GROQ_API_KEY
  • 使用tenacity实现速率限制的重试逻辑
  • 为实时应用启用流式传输
  • 异步模式用于并行查询
  • 不使用OpenAI(约束:禁止使用OPENAI) </success_criteria>
<core_content> 为实时应用提供超快速LLM推理。GROQ的推理速度比标准提供商快10-100倍。

Quick Reference: Model Selection

快速参考:模型选择

Use CaseModel IDContextNotes
General Chat
llama-3.3-70b-versatile
128KBest all-around
Fast Chat
llama-3.1-8b-instant
128KSimple tasks, fastest
Vision/OCR
meta-llama/llama-4-scout-17b-16e-instruct
128KUp to 5 images
STT
whisper-large-v3
448GROQ-hosted (NOT OpenAI API)
TTS
playai-tts
-Fritz-PlayAI voice
Reasoning
meta-llama/llama-4-maverick-17b-128e-instruct
128KThinking models
Tool Use
compound-beta
-Built-in web search, code exec
使用场景模型ID上下文窗口说明
通用聊天
llama-3.3-70b-versatile
128K全能首选
快速聊天
llama-3.1-8b-instant
128K简单任务,速度最快
视觉/OCR
meta-llama/llama-4-scout-17b-16e-instruct
128K最多支持5张图片
语音转文本(STT)
whisper-large-v3
448GROQ托管(非OpenAI API)
文本转语音(TTS)
playai-tts
-Fritz-PlayAI音色
推理任务
meta-llama/llama-4-maverick-17b-128e-instruct
128K推理专用模型
工具调用
compound-beta
-内置网页搜索、代码执行

Core Patterns

核心模式

1. Chat Completion (Basic + Streaming)

1. 聊天补全(基础版 + 流式版)

python
import os
from groq import Groq, AsyncGroq

client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

def chat(prompt: str, system: str = "You are helpful.") -> str:
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_completion_tokens=1024,
    )
    return response.choices[0].message.content
python
import os
from groq import Groq, AsyncGroq

client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

def chat(prompt: str, system: str = "You are helpful.") -> str:
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_completion_tokens=1024,
    )
    return response.choices[0].message.content

Streaming

Streaming

def stream_chat(prompt: str): stream = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": prompt}], stream=True, ) for chunk in stream: if chunk.choices[0].delta.content: yield chunk.choices[0].delta.content
undefined
def stream_chat(prompt: str): stream = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": prompt}], stream=True, ) for chunk in stream: if chunk.choices[0].delta.content: yield chunk.choices[0].delta.content
undefined

2. Vision / Multimodal

2. 视觉 / 多模态

python
import base64

def analyze_image(image_path: str, prompt: str) -> str:
    with open(image_path, "rb") as f:
        image_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="meta-llama/llama-4-scout-17b-16e-instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
            ]
        }],
    )
    return response.choices[0].message.content
python
import base64

def analyze_image(image_path: str, prompt: str) -> str:
    with open(image_path, "rb") as f:
        image_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="meta-llama/llama-4-scout-17b-16e-instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
            ]
        }],
    )
    return response.choices[0].message.content

URL-based: just pass {"url": "https://..."} instead of base64

URL-based: just pass {"url": "https://..."} instead of base64

undefined
undefined

3. Audio: Speech-to-Text (GROQ-Hosted Whisper)

3. 音频:语音转文本(GROQ托管的Whisper)

Note: Whisper on GROQ runs on GROQ hardware - NOT calling OpenAI's API. Whisper is an open-source model that GROQ hosts for fast inference.
python
def transcribe(audio_path: str, language: str = "en") -> str:
    with open(audio_path, "rb") as f:
        result = client.audio.transcriptions.create(
            file=f,
            model="whisper-large-v3",  # GROQ-hosted, not OpenAI API
            language=language,
            response_format="verbose_json",  # Includes timestamps
        )
    return result.text

def translate_to_english(audio_path: str) -> str:
    with open(audio_path, "rb") as f:
        result = client.audio.translations.create(file=f, model="whisper-large-v3")
    return result.text
Alternative STT Providers (if you prefer non-Whisper options):
  • Deepgram - Real-time streaming, lowest latency (
    pip install deepgram-sdk
    )
  • AssemblyAI - High accuracy, speaker diarization (
    pip install assemblyai
    )
  • See
    voice-ai-skill
    for Deepgram/AssemblyAI integration patterns
注意: GROQ上的Whisper运行在GROQ硬件上——并非调用OpenAI的API。 Whisper是GROQ托管的开源模型,可实现快速推理。
python
def transcribe(audio_path: str, language: str = "en") -> str:
    with open(audio_path, "rb") as f:
        result = client.audio.transcriptions.create(
            file=f,
            model="whisper-large-v3",  # GROQ-hosted, not OpenAI API
            language=language,
            response_format="verbose_json",  # Includes timestamps
        )
    return result.text

def translate_to_english(audio_path: str) -> str:
    with open(audio_path, "rb") as f:
        result = client.audio.translations.create(file=f, model="whisper-large-v3")
    return result.text
替代STT提供商(如果偏好非Whisper选项):
  • Deepgram - 实时流式传输,延迟最低(
    pip install deepgram-sdk
  • AssemblyAI - 高精度,支持说话人分离(
    pip install assemblyai
  • 查看
    voice-ai-skill
    获取Deepgram/AssemblyAI的集成模式

4. Audio: Text-to-Speech (PlayAI)

4. 音频:文本转语音(PlayAI)

python
def text_to_speech(text: str, output_path: str = "output.wav"):
    response = client.audio.speech.create(
        model="playai-tts",
        voice="Fritz-PlayAI",  # Also: Arista-PlayAI
        input=text,
        response_format="wav",
    )
    response.write_to_file(output_path)
python
def text_to_speech(text: str, output_path: str = "output.wav"):
    response = client.audio.speech.create(
        model="playai-tts",
        voice="Fritz-PlayAI",  # Also: Arista-PlayAI
        input=text,
        response_format="wav",
    )
    response.write_to_file(output_path)

Streaming TTS

Streaming TTS

def stream_tts(text: str): with client.audio.speech.with_streaming_response.create( model="playai-tts", voice="Fritz-PlayAI", input=text, response_format="wav" ) as response: for chunk in response.iter_bytes(1024): yield chunk

**Alternative TTS Providers** (beyond GROQ's PlayAI):
- **Cartesia** - Ultra-low latency, emotional control (`pip install cartesia`)
- **ElevenLabs** - Most natural voices, voice cloning (`pip install elevenlabs`)
- **Deepgram** - Fast, cost-effective (`pip install deepgram-sdk`)
- See `voice-ai-skill` for Cartesia/ElevenLabs/Deepgram TTS integration patterns
def stream_tts(text: str): with client.audio.speech.with_streaming_response.create( model="playai-tts", voice="Fritz-PlayAI", input=text, response_format="wav" ) as response: for chunk in response.iter_bytes(1024): yield chunk

**替代TTS提供商**(GROQ PlayAI之外的选项):
- **Cartesia** - 超低延迟,支持情绪控制(`pip install cartesia`)
- **ElevenLabs** - 最自然的音色,支持语音克隆(`pip install elevenlabs`)
- **Deepgram** - 快速且性价比高(`pip install deepgram-sdk`)
- 查看`voice-ai-skill`获取Cartesia/ElevenLabs/Deepgram的TTS集成模式

5. Tool Use / Function Calling

5. 工具调用 / 函数调用

python
import json

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
}]

def chat_with_tools(prompt: str):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile", messages=messages, tools=tools, tool_choice="auto"
    )
    msg = response.choices[0].message

    if msg.tool_calls:
        for tc in msg.tool_calls:
            result = execute_function(tc.function.name, json.loads(tc.function.arguments))
            messages.extend([msg, {"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)}])
        return client.chat.completions.create(model="llama-3.3-70b-versatile", messages=messages, tools=tools).choices[0].message.content
    return msg.content
python
import json

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
}]

def chat_with_tools(prompt: str):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile", messages=messages, tools=tools, tool_choice="auto"
    )
    msg = response.choices[0].message

    if msg.tool_calls:
        for tc in msg.tool_calls:
            result = execute_function(tc.function.name, json.loads(tc.function.arguments))
            messages.extend([msg, {"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)}])
        return client.chat.completions.create(model="llama-3.3-70b-versatile", messages=messages, tools=tools).choices[0].message.content
    return msg.content

6. Compound Beta (Built-in Web Search + Code Exec)

6. Compound Beta(内置网页搜索 + 代码执行)

python
def compound_query(prompt: str):
    """Built-in tools: web_search, code_execution."""
    response = client.chat.completions.create(
        model="compound-beta",
        messages=[{"role": "user", "content": prompt}],
    )
    msg = response.choices[0].message
    # Access msg.executed_tools for tool results
    return msg.content
python
def compound_query(prompt: str):
    """Built-in tools: web_search, code_execution."""
    response = client.chat.completions.create(
        model="compound-beta",
        messages=[{"role": "user", "content": prompt}],
    )
    msg = response.choices[0].message
    # Access msg.executed_tools for tool results
    return msg.content

7. Reasoning Models

7. 推理模型

python
def reasoning_query(prompt: str, format: str = "parsed"):
    """format: 'parsed' (structured), 'raw' (visible), 'hidden' (no thinking)"""
    response = client.chat.completions.create(
        model="meta-llama/llama-4-maverick-17b-128e-instruct",
        messages=[{"role": "user", "content": prompt}],
        reasoning_format=format,
    )
    msg = response.choices[0].message
    if format == "parsed" and hasattr(msg, 'reasoning'):
        return {"thinking": msg.reasoning, "answer": msg.content}
    return msg.content
python
def reasoning_query(prompt: str, format: str = "parsed"):
    """format: 'parsed' (structured), 'raw' (visible), 'hidden' (no thinking)"""
    response = client.chat.completions.create(
        model="meta-llama/llama-4-maverick-17b-128e-instruct",
        messages=[{"role": "user", "content": prompt}],
        reasoning_format=format,
    )
    msg = response.choices[0].message
    if format == "parsed" and hasattr(msg, 'reasoning'):
        return {"thinking": msg.reasoning, "answer": msg.content}
    return msg.content

8. Async Patterns

8. 异步模式

python
async_client = AsyncGroq(api_key=os.environ.get("GROQ_API_KEY"))

async def async_chat(prompt: str) -> str:
    response = await async_client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

async def parallel_queries(prompts: list[str]) -> list[str]:
    import asyncio
    return await asyncio.gather(*[async_chat(p) for p in prompts])
python
async_client = AsyncGroq(api_key=os.environ.get("GROQ_API_KEY"))

async def async_chat(prompt: str) -> str:
    response = await async_client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

async def parallel_queries(prompts: list[str]) -> list[str]:
    import asyncio
    return await asyncio.gather(*[async_chat(p) for p in prompts])

Rate Limits

速率限制

TierRequests/minTokens/minTokens/day
Free3015,000500,000
Paid100+100,000+Unlimited
python
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def reliable_chat(prompt: str) -> str:
    return chat(prompt)
等级每分钟请求数每分钟令牌数每日令牌数
免费版3015,000500,000
付费版100+100,000+无限制
python
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def reliable_chat(prompt: str) -> str:
    return chat(prompt)

Integration Notes

集成说明

  • Pairs with: voice-ai-skill (Whisper STT + PlayAI TTS), langgraph-agents-skill
  • Complements: trading-signals-skill (fast analysis), data-analysis-skill
  • Projects: VozLux (voice agents), FieldVault-AI (document processing)
  • Constraint: NO OPENAI - GROQ is the fast inference layer
  • 搭配使用:voice-ai-skill(Whisper语音转文本 + PlayAI文本转语音)、langgraph-agents-skill
  • 补充工具:trading-signals-skill(快速分析)、data-analysis-skill
  • 适用项目:VozLux(语音Agent)、FieldVault-AI(文档处理)
  • 约束条件:禁止使用OPENAI - GROQ作为快速推理层

Environment Variables

环境变量

bash
GROQ_API_KEY=gsk_...  # Required - get from console.groq.com
bash
GROQ_API_KEY=gsk_...  # 必填 - 从console.groq.com获取

Optional multi-provider

可选多提供商配置

ANTHROPIC_API_KEY= # Claude for complex reasoning GOOGLE_API_KEY= # Gemini fallback
undefined
ANTHROPIC_API_KEY= # 用于复杂推理的Claude GOOGLE_API_KEY= # 备用的Gemini
undefined

Reference Files

参考文档

  • reference/models-catalog.md
    - Complete model catalog with specs
  • reference/audio-speech.md
    - Whisper STT and PlayAI TTS deep dive
  • reference/vision-multimodal.md
    - Multimodal and image processing
  • reference/tool-use-patterns.md
    - Function calling and Compound Beta
  • reference/reasoning-models.md
    - Thinking models and reasoning_format
  • reference/cost-optimization.md
    - Batch API, caching, provider routing
  • reference/models-catalog.md
    - 完整模型目录及规格
  • reference/audio-speech.md
    - Whisper语音转文本和PlayAI文本转语音深度解析
  • reference/vision-multimodal.md
    - 多模态与图像处理
  • reference/tool-use-patterns.md
    - 函数调用与Compound Beta
  • reference/reasoning-models.md
    - 推理模型与reasoning_format
  • reference/cost-optimization.md
    - 批量API、缓存、提供商路由