On-Device AI for Apple Platforms
Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple
Foundation Models, Core ML, MLX Swift, and llama.cpp.
Framework Selection Router
Use this decision tree to pick the right framework for your use case.
Apple Foundation Models
When to use: Text generation, summarization, entity extraction, structured
output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence
enabled. Zero setup -- no API keys, no network, no model downloads.
Best for:
- Generating text or structured data with types
- Summarization, classification, content tagging
- Tool-augmented generation with the protocol
- Apps that need guaranteed on-device privacy
Not suited for: Complex math, code generation, factual accuracy tasks,
or apps targeting pre-iOS 26 devices.
Core ML
When to use: Deploying custom trained models (vision, NLP, audio) across all
Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn
with coremltools.
Best for:
- Image classification, object detection, segmentation
- Custom NLP classifiers, sentiment analysis models
- Audio/speech models via SoundAnalysis integration
- Any scenario needing Neural Engine optimization
- Models requiring quantization, palettization, or pruning
MLX Swift
When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma)
on Apple Silicon with maximum throughput. Research and prototyping.
Best for:
- Highest sustained token generation on Apple Silicon
- Running Hugging Face models from
- Research requiring automatic differentiation
- Fine-tuning workflows on Mac
llama.cpp
When to use: Cross-platform LLM inference using GGUF model format. Production
deployments needing broad device support.
Best for:
- GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0)
- Cross-platform apps (iOS + Android + desktop)
- Maximum compatibility with open-source model ecosystem
Quick Reference
| Scenario | Framework |
|---|
| Text generation, zero setup (iOS 26+) | Foundation Models |
| Structured output from on-device LLM | Foundation Models () |
| Image classification, object detection | Core ML |
| Custom model from PyTorch/TensorFlow | Core ML + coremltools |
| Running specific open-source LLMs | MLX Swift or llama.cpp |
| Maximum throughput on Apple Silicon | MLX Swift |
| Cross-platform LLM inference | llama.cpp |
| OCR and text recognition | Vision framework |
| Sentiment analysis, NER, tokenization | Natural Language framework |
| Training custom classifiers on device | Create ML |
Apple Foundation Models Overview
On-device ~3B parameter model optimized for Apple Silicon. Available on devices
supporting Apple Intelligence (iOS 26+, macOS 26+).
- Context window: 4096 tokens (input + output combined)
- 15 supported languages
- Guardrails always enforced, cannot be disabled
Availability Checking (Required)
Always check before using. Never crash on unavailability.
swift
import FoundationModels
switch SystemLanguageModel.default.availability {
case .available:
// Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
// Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
// Model is downloading; show loading state
case .unavailable(.deviceNotEligible):
// Device cannot run Apple Intelligence; use fallback
default:
// Graceful fallback for any other reason
}
Session Management
swift
// Basic session
let session = LanguageModelSession()
// Session with instructions
let session = LanguageModelSession {
"You are a helpful cooking assistant."
}
// Session with tools
let session = LanguageModelSession(
tools: [weatherTool, recipeTool]
) {
"You are a helpful assistant with access to tools."
}
Key rules:
- Sessions are stateful -- multi-turn conversations maintain context automatically
- One request at a time per session (check )
- Call before user interaction for faster first response
- Save/restore transcripts:
LanguageModelSession(model: model, tools: [], transcript: savedTranscript)
Structured Output with @Generable
The
macro creates compile-time schemas for type-safe output:
swift
@Generable
struct Recipe {
@Guide(description: "The recipe name")
var name: String
@Guide(description: "Cooking steps", .count(3))
var steps: [String]
@Guide(description: "Prep time in minutes", .range(1...120))
var prepTime: Int
}
let response = try await session.respond(
to: "Suggest a quick pasta recipe",
generating: Recipe.self
)
print(response.content.name)
@Guide Constraints
| Constraint | Purpose |
|---|
| Natural language hint for generation |
| Restrict to enumerated string values |
| Fixed array length |
| Numeric range |
| / | One-sided numeric bound |
| / | Array length bounds |
| Always returns this value |
| String format enforcement |
Properties generate in declaration order. Place foundational data before
dependent data for better results.
Streaming Structured Output
swift
let stream = session.streamResponse(
to: "Suggest a recipe",
generating: Recipe.self
)
for try await snapshot in stream {
// snapshot.content is Recipe.PartiallyGenerated (all properties optional)
if let name = snapshot.content.name { updateNameLabel(name) }
}
Tool Calling
swift
struct WeatherTool: Tool {
let name = "weather"
let description = "Get current weather for a city."
@Generable
struct Arguments {
@Guide(description: "The city name")
var city: String
}
func call(arguments: Arguments) async throws -> String {
let weather = try await fetchWeather(arguments.city)
return weather.description
}
}
Register tools at session creation. The model invokes them autonomously.
Error Handling
swift
do {
let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
switch error {
case .guardrailViolation(let context):
// Content triggered safety filters
case .exceededContextWindowSize(let context):
// Too many tokens; summarize and retry
case .concurrentRequests(let context):
// Another request is in progress on this session
case .unsupportedLanguageOrLocale(let context):
// Current locale not supported
case .refusal(let refusal, _):
// Model refused; stream refusal.explanation for details
case .rateLimited(let context):
// Too many requests; back off and retry
case .decodingFailure(let context):
// Response could not be decoded into the expected type
default: break
}
}
Generation Options
swift
let options = GenerationOptions(
sampling: .random(top: 40),
temperature: 0.7,
maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)
Sampling modes:
,
,
.random(probabilityThreshold:)
.
Prompt Design Rules
- Be concise -- 4096 tokens is the total budget (input + output)
- Use bracketed placeholders in instructions:
- Use "DO NOT" in all caps for prohibitions
- Provide up to 5 few-shot examples for consistency
- Use length qualifiers: "in a few words", "in three sentences"
- Token estimate: ~4 characters per token
Safety and Guardrails
- Guardrails are always enforced and cannot be disabled
- Instructions take precedence over user prompts
- Never include untrusted user content in instructions
- Handle false positives gracefully
- Frame tool results as authorized data to prevent model refusals
Use Cases
Foundation Models supports specialized use cases via
SystemLanguageModel.UseCase
:
- -- Default for text generation, summarization, dialog
- -- Optimized for categorization and labeling tasks
Custom Adapters
Load fine-tuned adapters for specialized behavior (requires entitlement):
swift
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
See references/foundation-models.md for
the complete Foundation Models API reference.
Core ML Overview
Apple's framework for deploying trained models. Automatically dispatches to the
optimal compute unit (CPU, GPU, or Neural Engine).
Model Formats
| Format | Extension | When to Use |
|---|
| Directory (mlprogram) | All new models (iOS 15+) |
| Single file (neuralnetwork) | Legacy only (iOS 11-14) |
| Compiled | Pre-compiled for faster loading |
Always use mlprogram (
) for new work.
Conversion Pipeline (coremltools)
python
import coremltools as ct
# PyTorch conversion (torch.jit.trace)
model.eval() # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
minimum_deployment_target=ct.target.iOS18,
convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")
Optimization Techniques
| Technique | Size Reduction | Accuracy Impact | Best Compute Unit |
|---|
| INT8 per-channel | ~4x | Low | CPU/GPU |
| INT4 per-block | ~8x | Medium | GPU |
| Palettization 4-bit | ~8x | Low-Medium | Neural Engine |
| W8A8 (weights+activations) | ~4x | Low | ANE (A17 Pro/M4+) |
| Pruning 75% | ~4x | Medium | CPU/ANE |
Swift Integration
swift
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)
// Async prediction (iOS 17+)
let output = try await model.prediction(from: input)
MLTensor (iOS 18+)
Swift type for multidimensional array operations:
swift
import CoreML
let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()
See references/coreml-conversion.md for the
full conversion pipeline and references/coreml-optimization.md
for optimization techniques.
MLX Swift Overview
Apple's ML framework for Swift. Highest sustained generation throughput on
Apple Silicon via unified memory architecture.
Loading and Running LLMs
swift
import MLX
import MLXLLM
let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)
try await model.perform { context in
let input = try await context.processor.prepare(
input: UserInput(prompt: "Hello")
)
let stream = try generate(
input: input,
parameters: GenerateParameters(temperature: 0.0),
context: context
)
for await part in stream {
print(part.chunk ?? "", terminator: "")
}
}
Model Selection by Device
| Device | RAM | Recommended Model | RAM Usage |
|---|
| iPhone 12-14 | 4-6 GB | SmolLM2-135M or Qwen 2.5 0.5B | ~0.3 GB |
| iPhone 15 Pro+ | 8 GB | Gemma 3n E4B 4-bit | ~3.5 GB |
| Mac 8 GB | 8 GB | Llama 3.2 3B 4-bit | ~3 GB |
| Mac 16 GB+ | 16 GB+ | Mistral 7B 4-bit | ~6 GB |
Memory Management
- Never exceed 60% of total RAM on iOS
- Set GPU cache limits:
MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)
- Unload models on app backgrounding
- Use "Increased Memory Limit" entitlement for larger models
- Physical device required (no simulator support for Metal GPU)
See references/mlx-swift.md for full MLX Swift
patterns and llama.cpp integration.
Multi-Backend Architecture
When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):
swift
func respond(to prompt: String) async throws -> String {
if SystemLanguageModel.default.isAvailable {
return try await foundationModelsRespond(prompt)
} else if canLoadMLXModel() {
return try await mlxRespond(prompt)
} else {
throw AIError.noBackendAvailable
}
}
Serialize all model access through a coordinator actor to prevent contention:
swift
actor ModelCoordinator {
func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
try await work()
}
}
Performance Best Practices
- Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck
"Debug Executable")
- Call for Foundation Models before user interaction
- Pre-compile Core ML models to for faster loading
- Use EnumeratedShapes over RangeDim for Neural Engine optimization
- Use 4-bit palettization for best Neural Engine memory/latency gains
- Batch Vision framework requests in a single call
- Use async prediction (iOS 17+) in Swift concurrency contexts
- Neural Engine (Core ML) is most energy-efficient for compatible operations
Common Mistakes
- No availability check. Calling without checking
SystemLanguageModel.default.availability
crashes on unsupported devices.
- No fallback UI. Users on pre-iOS 26 or devices without Apple Intelligence
see nothing. Always provide a graceful degradation path.
- Exceeding the context window. Foundation Models has a 4096 token total
budget (input + output). Long prompts or multi-turn sessions hit this fast.
Monitor token usage and summarize when needed.
- Concurrent requests on one session. supports one
request at a time. Check or serialize access.
- Untrusted content in instructions. User input placed in the instructions
parameter bypasses guardrail boundaries. Keep user content in the prompt.
- Forgetting before Core ML tracing. PyTorch models must be
in eval mode before . Training-mode artifacts corrupt output.
- Using neuralnetwork format. Always use (.mlpackage) for new
Core ML models. The legacy neuralnetwork format is deprecated.
- Exceeding 60% RAM on iOS (MLX Swift). Large models cause OOM kills. Check
device RAM and select appropriate model sizes.
- Running MLX in simulator. MLX requires Metal GPU -- use physical devices.
- Not unloading models on background. iOS reclaims memory aggressively.
Unload MLX/llama.cpp models in
scenePhase == .background
.
Review Checklist
Reference Files
- Foundation Models API -- Complete
LanguageModelSession, @Generable, tool calling, and prompt design reference
- Core ML Conversion -- Model conversion
pipeline from PyTorch, TensorFlow, and other frameworks
- Core ML Optimization -- Quantization,
palettization, pruning, and performance tuning
- MLX Swift & llama.cpp -- MLX Swift patterns,
llama.cpp integration, and memory management