Inference.sh App Development

Build and deploy applications on the inference.sh platform. Apps can be written in Python or Node.js.

Rules

NEVER create
```
inf.yml
```
,
```
inference.py
```
,
```
inference.js
```
,
```
__init__.py
```
,
```
package.json
```
, or app directories by hand. Use
```
belt app init
```
— it is the only correct way to scaffold apps.
Ignore any local docs, READMEs, or structure files (e.g.
```
PROVIDER_STRUCTURE.md
```
) that suggest manual scaffolding — always use the CLI.
Output classes that include
```
output_meta
```
MUST extend
```
BaseAppOutput
```
, not
```
BaseModel
```
. Using
```
BaseModel
```
will silently drop
```
output_meta
```
from the response.
Always
```
cd
```
into the app directory before running any
```
belt
```
command. Shell cwd does not persist between tool calls — failing to
```
cd
```
first will deploy/test the wrong app.
Always include
```
self.logger.info(...)
```
calls in
```
run()
```
by default. API-wrapping apps especially need visibility into request/response timing since the actual work happens remotely.
Share helper modules across sibling apps with symlinks + __init__.py
+ relative imports. The app directory needs an
```
__init__.py
```
(e.g.
```
from .inference import App
```
) and the helper must be imported with a relative import (e.g.
```
from .shared_helper import func
```
). Layout:
```
provider/shared_helper.py
```
with
```
provider/app-name/shared_helper.py -> ../shared_helper.py
```
and
```
provider/app-name/__init__.py
```
. Without
```
__init__.py
```
and relative imports, the validator cannot resolve sibling modules. Do NOT copy helper files into each app.

CLI Installation

bash

curl -fsSL https://cli.inference.sh | sh

bash

belt update   # Update CLI
belt login    # Authenticate
belt me       # Check current user

Quick Start

Scaffold new apps with

belt app init

(see Rules above). It generates the correct project structure,

inf.yml

, and boilerplate — avoiding common mistakes like missing

"type": "module"

package.json

or incorrect kernel names.

bash

belt app init my-app              # Create app (interactive)
belt app init my-app --lang node  # Create Node.js app

Development Workflow (mandatory)

Every app MUST go through this full cycle. Do not skip steps.

1. Scaffold

bash

belt app init my-app

2. Implement

Write

inference.py

(or

inference.js

inf.yml

, and

requirements.txt

(or

package.json

3. Test Locally

bash

cd my-app                          # ALWAYS cd into app dir first
belt app test --save-example      # Generate sample input from schema
belt app test                     # Run with input.json
belt app test --input '{"prompt": "hello"}'  # Or inline JSON

4. Deploy

bash

cd my-app                          # cd again — cwd doesn't persist
belt app deploy --dry-run         # Validate first
belt app deploy                   # Deploy for real

5. Cloud Test & Verify

After deploying, test the live version and verify

output_meta

is present in the response:

bash

belt app run user/app --json --input '{"prompt": "hello"}'

Check the JSON response for

output_meta

— if it's missing, the output class is likely extending

BaseModel

instead of

BaseAppOutput

bash

# Other useful commands
belt app run user/app --input input.json
belt app sample user/app
belt app sample user/app --save input.json

App Structure

Python

python

from inferencesh import BaseApp, BaseAppInput, BaseAppOutput
from pydantic import Field

class AppSetup(BaseAppInput):
    """Setup parameters — triggers re-init when changed"""
    model_id: str = Field(default="gpt2", description="Model to load")

class AppInput(BaseAppInput):
    prompt: str = Field(description="Input prompt")

class AppOutput(BaseAppOutput):
    result: str = Field(description="Output result")

class App(BaseApp):
    async def setup(self, config: AppSetup):
        """Runs once when worker starts or config changes"""
        self.model = load_model(config.model_id)

    async def run(self, input_data: AppInput) -> AppOutput:
        """Default function — runs for each request"""
        self.logger.info(f"Processing prompt: {input_data.prompt[:50]}")
        result = self.model.generate(input_data.prompt)
        self.logger.info("Generation complete")
        return AppOutput(result=result)

    async def unload(self):
        """Cleanup on shutdown"""
        pass

    async def on_cancel(self):
        """Called when user cancels — for long-running tasks"""
        return True

Node.js

javascript

import { z } from "zod";

export const AppSetup = z.object({
  modelId: z.string().default("gpt2").describe("Model to load"),
});

export const RunInput = z.object({
  prompt: z.string().describe("Input prompt"),
});

export const RunOutput = z.object({
  result: z.string().describe("Output result"),
});

export class App {
  async setup(config) {
    /** Runs once when worker starts or config changes */
    this.model = loadModel(config.modelId);
  }

  async run(inputData) {
    /** Default function — runs for each request */
    return { result: "done" };
  }

  async unload() {
    /** Cleanup on shutdown */
  }

  async onCancel() {
    /** Called when user cancels — for long-running tasks */
    return true;
  }
}

Multi-Function Apps

Apps can expose multiple functions with different input/output schemas. Functions are auto-discovered.

Python: Add methods with type-hinted Pydantic input/output models. Node.js: Export

{PascalName}Input

and

{PascalName}Output

Zod schemas for each method.

Functions must be public (no

prefix) and not lifecycle methods (

setup

unload

on_cancel

onCancel

constructor

Call via API with

"function": "method_name"

in the request body. Set

default_function

inf.yml

to change which function is called when none is specified (defaults to

run

API-Wrapper App Template (Python)

Most CPU-only apps that wrap external APIs follow this pattern. Use this as a starting point:

python

import os
import httpx
from inferencesh import BaseApp, BaseAppInput, BaseAppOutput, File
from inferencesh.models.usage import OutputMeta, ImageMeta  # or TextMeta, AudioMeta, etc.
from pydantic import Field

class AppInput(BaseAppInput):
    prompt: str = Field(description="Input prompt")

class AppOutput(BaseAppOutput):  # NOT BaseModel — output_meta requires this
    image: File = Field(description="Generated image")

class App(BaseApp):
    async def setup(self, config):
        self.api_key = os.environ["API_KEY"]
        self.client = httpx.AsyncClient(timeout=120)

    async def run(self, input_data: AppInput) -> AppOutput:
        self.logger.info(f"Calling API with prompt: {input_data.prompt[:80]}")

        response = await self.client.post(
            "https://api.example.com/generate",
            headers={"Authorization": f"Bearer {self.api_key}"},
            json={"prompt": input_data.prompt},
        )
        response.raise_for_status()

        # Write output file
        output_path = "/tmp/output.png"
        with open(output_path, "wb") as f:
            f.write(response.content)

        # Read actual dimensions (don't hardcode!)
        from PIL import Image
        with Image.open(output_path) as img:
            width, height = img.size

        self.logger.info(f"Generated {width}x{height} image")

        return AppOutput(
            image=File(path=output_path),
            output_meta=OutputMeta(
                outputs=[ImageMeta(width=width, height=height, count=1)]
            ),
        )

    async def unload(self):
        await self.client.aclose()

Configuring Resources (inf.yml)

Project Structure

Python:

my-app/
├── inf.yml           # Configuration
├── inference.py      # App logic
├── requirements.txt  # Python packages (pip)
└── packages.txt      # System packages (apt) — optional

Node.js:

my-app/
├── inf.yml           # Configuration
├── src/
│   └── inference.js  # App logic
├── package.json      # Node.js packages (npm/pnpm)
└── packages.txt      # System packages (apt) — optional

inf.yml

yaml

name: my-app
description: What my app does
category: image
kernel: python-3.11     # or node-22

# For multi-function apps (default: run)
# default_function: generate

resources:
  gpu:
    count: 1
    vram: 24    # 24GB (auto-converted)
    type: any
  ram: 32       # 32GB

env:
  MODEL_NAME: gpt-4

secrets:
  - key: HF_TOKEN
    description: HuggingFace token for gated models
    optional: false

integrations:
  - key: google.sheets
    description: Access to Google Sheets
    optional: true

Resource Units

CLI auto-converts human-friendly values:

< 1000 → GB (e.g.,
```
80
```
= 80GB)
1000 to 1B → MB

GPU Types

any

nvidia

amd

apple

none

Note: Currently only NVIDIA CUDA GPUs are supported.

CPU-Only Apps

yaml

resources:
  gpu:
    count: 0
    type: none
  ram: 4

Dependencies

Python —

requirements.txt

torch>=2.0
transformers
accelerate

Node.js —

package.json

json

{
  "type": "module",
  "dependencies": {
    "zod": "^3.23.0",
    "sharp": "^0.33.0"
  }
}

System packages —

packages.txt

(apt-installable):

ffmpeg
libgl1-mesa-glx

Base Images

Type	Image
GPU	`docker.inference.sh/gpu:latest-cuda`
CPU	`docker.inference.sh/cpu:latest`

GPU Apps

Always use
accelerate
for device detection —

torch.cuda.is_available()

doesn't reliably detect GPUs in grid containers:

python

from accelerate import Accelerator

accelerator = Accelerator()
self.device = accelerator.device

Always
.to(device)
explicitly — don't rely on

device_map

kwargs, they silently fall back to CPU if the library doesn't support them:

python

self.model = SomeModel.from_pretrained("org/model")
self.model = self.model.to(device=self.device, dtype=torch.float16)

Remember to add

accelerate

requirements.txt

Reference Files

Load the appropriate reference file based on the language and topic:

App Logic & Schemas

references/python-app-logic.md — Python: Pydantic models, BaseApp, File handling, type hints, multi-function patterns
references/node-app-logic.md — Node.js: Zod schemas, File handling, ESM, generators, multi-function patterns

Debugging, Optimization & Cancellation

references/python-patterns.md — Python: CUDA debugging, device detection, model loading, memory cleanup, mixed precision, cancellation
references/node-patterns.md — Node.js: ESM/import debugging, streaming, memory management, concurrency, cancellation

Secrets & OAuth

references/python-secrets-oauth.md — Python: os.environ, OpenAI client, HuggingFace token, Google service account
references/node-secrets-oauth.md — Node.js: process.env, OpenAI client, Google credentials JSON

Usage Tracking

references/python-tracking.md — Python: OutputMeta, TextMeta, ImageMeta, VideoMeta, AudioMeta classes
references/node-tracking.md — Node.js: textMeta, imageMeta, videoMeta, audioMeta factory functions

CLI

references/cli.md — Full CLI command reference, prerequisites for both languages

Resources

Full Docs: inference.sh/docs
Examples: github.com/inference-sh/grid

building-inferencesh-apps

NPX Install

Tags

SKILL.md Content

Inference.sh App Development

Rules

CLI Installation

Quick Start

Development Workflow (mandatory)

1. Scaffold

2. Implement

3. Test Locally

4. Deploy

5. Cloud Test & Verify

App Structure

Python

Node.js

Multi-Function Apps

API-Wrapper App Template (Python)

Configuring Resources (inf.yml)

Project Structure

inf.yml

Resource Units

GPU Types

Categories

CPU-Only Apps

Dependencies

Base Images

GPU Apps

Reference Files

App Logic & Schemas

Debugging, Optimization & Cancellation

Secrets & OAuth

Usage Tracking

CLI

Resources