Search Results: llm-inference

Found 21 Skills

AI & Machine Learningascend-ai-coding/awesome-...

vllm-ascend

vLLM Ascend plugin for LLM inference serving on Huawei Ascend NPU. Use for offline batch inference, API server deployment, quantization inference (with msmodelslim quantized models), tensor/pipeline parallelism for distributed serving, and OpenAI-compatible API endpoints. Supports Qwen, DeepSeek, GLM, LLaMA models with Ascend-optimized kernels.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningteam-telnyx/skills

telnyx-ai-inference-curl

Access Telnyx LLM inference APIs, embeddings, and AI analytics for call insights and summaries. This skill provides REST API (curl) examples.

🇺🇸|EnglishTranslated

AI & Machine Learningdatabricks/databricks-age...

databricks-model-serving

Manage Databricks Model Serving endpoints via CLI. Use when asked to create, configure, query, or manage model serving endpoints for LLM inference, custom models, or external models.

🇺🇸|EnglishTranslated

AI & Machine Learningsecondsky/claude-skills

cloudflare-workers-ai

Cloudflare Workers AI for serverless GPU inference. Use for LLMs, text/image generation, embeddings, or encountering AI_ERROR, rate limits, token exceeded errors.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningaradotso/trending-skills

flash-moe-inference

Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming

🇺🇸|EnglishTranslated

Platform Servicescascade-protocol/surf

surf

Build with Surf pay-per-use APIs at surf.cascade.fyi. Twitter data, Reddit data, web search/crawl, and LLM inference - no signup, no API keys, just pay per call. Use when working with Surf endpoints, fetching Twitter/X data, Reddit data, web crawling/search, pay-per-request LLM inference, setting up x402-proxy or @x402/fetch with Surf, or any mention of surf.cascade.fyi. Triggers on surf, surf.cascade.fyi, surf API, twitter data, reddit data, web crawl, surf inference, x402 endpoints, MCP surf tools.

🇺🇸|EnglishTranslated

AI & Machine Learningwinsorllc/upgraded-carniv...

local-llm-provider

Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic provider fallback. Use when: (1) you need to run LLM inference locally for privacy/cost, (2) you want to use models not available via cloud APIs, (3) you need offline capability, (4) you want automatic fallback to cloud providers when local fails.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learninghkuds/cli-anything

cli-anything-ollama

Command-line interface for Ollama - Local LLM inference and model management via Ollama REST API. Designed for AI agents and power users who need to manage models, generate text, chat, and create embeddings without a GUI.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

🇺🇸|EnglishTranslated