Loading...
Loading...
Found 16 Skills
Use this skill when building applications with Gemini models, Gemini API, working with multimodal content (text, images, audio, video), implementing function calling, using structured outputs, or needing current model specifications. Covers SDK usage (google-genai for Python, @google/genai for JavaScript/TypeScript), model selection, and API capabilities.
Guides the usage of Gemini API on Google Cloud Vertex AI with the Gen AI SDK. Use when the user asks about using Gemini in an enterprise environment or explicitly mentions Vertex AI. Covers SDK usage (Python, JS/TS, Go, Java, C#), capabilities like Live API, tools, multimedia generation, caching, and batch prediction.
This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.
Minimal multimodal omni smoke test for Model Studio Qwen Omni.
Integrate Gemini API with @google/genai SDK (NOT deprecated @google/generative-ai). Text generation, multimodal (images/video/audio/PDFs), function calling, thinking mode, streaming. 1M input tokens. Prevents 14 documented errors. Use when: Gemini integration, multimodal AI, reasoning with thinking mode. Troubleshoot: SDK deprecation, model not found, context window, function calling errors, streaming corruption, safety settings, rate limits.
Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Build production-ready LLM applications, advanced RAG systems, and intelligent agents. Implements vector search, multimodal AI, agent orchestration, and enterprise AI integrations. Use PROACTIVELY for LLM features, chatbots, AI agents, or AI-powered applications.
Expert guidance for working with Hugging Face Transformers library for NLP, computer vision, and multimodal AI tasks.
This skill should be used when the user asks to "generate video prompts", "create Seedance prompts", "write video descriptions", mentions "Seedance", "seedance", "Jimeng", "Jimeng Platform", "video prompts", "video generation", "AI video", "short drama", "advertising video", "video extension", or discusses video prompt engineering, AI video generation, or Seedance 2.0 workflows.
Understand images with Alibaba Cloud Model Studio Qwen VL models (qwen3-vl-plus/qwen3-vl-flash and latest aliases). Use when building image Q&A, visual analysis, OCR-like extraction, chart/table reading, or screenshot understanding workflows.
Multimodal AI processing via Google Gemini API (2M tokens context). Capabilities: audio (transcription, 9.5hr max, summarization, music analysis), images (captioning, OCR, object detection, segmentation, visual Q&A), video (scene detection, 6hr max, YouTube URLs, temporal analysis), documents (PDF extraction, tables, forms, charts), image generation (text-to-image, editing). Actions: transcribe, analyze, extract, caption, detect, segment, generate from media. Keywords: Gemini API, audio transcription, image captioning, OCR, object detection, video analysis, PDF extraction, text-to-image, multimodal, speech recognition, visual Q&A, scene detection, YouTube transcription, table extraction, form processing, image generation, Imagen. Use when: transcribing audio/video, analyzing images/screenshots, extracting data from PDFs, processing YouTube videos, generating images from text, implementing multimodal AI features.
Use when visual reasoning is needed with Alibaba Cloud Model Studio QVQ models, including step-by-step image reasoning, chart analysis, and visually grounded problem solving.