Loading...
Loading...
Found 21 Skills
vLLM Ascend plugin for LLM inference serving on Huawei Ascend NPU. Use for offline batch inference, API server deployment, quantization inference (with msmodelslim quantized models), tensor/pipeline parallelism for distributed serving, and OpenAI-compatible API endpoints. Supports Qwen, DeepSeek, GLM, LLaMA models with Ascend-optimized kernels.
Access Telnyx LLM inference APIs, embeddings, and AI analytics for call insights and summaries. This skill provides REST API (curl) examples.
Manage Databricks Model Serving endpoints via CLI. Use when asked to create, configure, query, or manage model serving endpoints for LLM inference, custom models, or external models.
Cloudflare Workers AI for serverless GPU inference. Use for LLMs, text/image generation, embeddings, or encountering AI_ERROR, rate limits, token exceeded errors.
Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming
Build with Surf pay-per-use APIs at surf.cascade.fyi. Twitter data, Reddit data, web search/crawl, and LLM inference - no signup, no API keys, just pay per call. Use when working with Surf endpoints, fetching Twitter/X data, Reddit data, web crawling/search, pay-per-request LLM inference, setting up x402-proxy or @x402/fetch with Surf, or any mention of surf.cascade.fyi. Triggers on surf, surf.cascade.fyi, surf API, twitter data, reddit data, web crawl, surf inference, x402 endpoints, MCP surf tools.
Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic provider fallback. Use when: (1) you need to run LLM inference locally for privacy/cost, (2) you want to use models not available via cloud APIs, (3) you need offline capability, (4) you want automatic fallback to cloud providers when local fails.
Command-line interface for Ollama - Local LLM inference and model management via Ollama REST API. Designed for AI agents and power users who need to manage models, generate text, chat, and create embeddings without a GUI.
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.