A vLLM-style inference server for Apple Silicon with a native MLX backend, exposing both OpenAI and Anthropic compatible APIs in a single process, featuring multimodal unified serving, continuous batching, paged KV cache, and SSD-tiered caching.
Core Positioning#
vllm-mlx is an LLM inference serving framework designed exclusively for Apple Silicon (M1/M2/M3/M4) + Metal GPU + macOS. Inspired by vLLM, it is fully reimplemented on the Apple MLX framework, addressing the lack of high-throughput, production-grade LLM inference servers in the Apple ecosystem.
API Compatibility#
- OpenAI compatible endpoints:
/v1/chat/completions,/v1/completions,/v1/embeddings,/v1/rerank,/v1/responses - Anthropic compatible endpoint:
/v1/messages(full streaming, tool use, system prompts support) - MCP Tool Calling: 12 built-in tool parsers covering OpenAI, Anthropic, Gemini, Qwen, DeepSeek, Gemma, etc.
- Structured Output: JSON Schema via
response_format(based on lm-format-enforcer)
Throughput & Memory Optimization#
- Continuous Batching: high-concurrency request throughput
- Paged KV Cache: prefix-sharing memory-efficient KV cache
- SSD-tiered KV Cache: prefix cache overflow to disk (
--ssd-cache-dir) for long-context agents - Warm Prompts: pre-load popular prefixes at startup (
--warm-prompts), TTFT improvement 1.3–2.25x - Prefix Cache: trie-based cross-request prefix sharing
Multimodal Processing#
- Unified serving: text + image + video + audio
- Vision models: Gemma 3/4, Qwen3-VL, Pixtral, Llama vision
- Audio input via
audio_urlcontent blocks - Native TTS: 11 voices, 15+ languages (Kokoro, Chatterbox, VibeVoice, VoxCPM)
- Native STT: Whisper series, RTF up to 197x (M4 Max)
Advanced Inference & Acceleration#
- Reasoning Extraction: Qwen3, DeepSeek-R1 (
--reasoning-parser) - MoE Expert Reduction:
--moe-top-k, Qwen3-30B-A3B speedup 7–16% - Speculative Decoding:
--mtp(e.g., Qwen3-Next) - Sparse Prefill:
--spec-prefillto reduce TTFT
Observability#
- Prometheus metrics:
/metricsendpoint (--metrics) - Built-in benchmarking:
vllm-mlx bench-servewith CSV/JSON/SQLite output
Typical Use Cases#
- Local replacement backend for Claude Code / OpenCode
- Local high-concurrency LLM serving
- Multimodal agent unified entry point
- Ultra-long context conversations (via disk overflow cache)
- Localized TTS/STT processing
- Text embedding and reranking services
Quick Start#
uv tool install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
Architecture#
API routing layer (OpenAI/Anthropic/rerank/metrics) → Scheduling layer (Continuous batching, Paged KV cache, Prefix cache, SSD tiering) → Model execution layer (mlx-lm, mlx-vlm, mlx-audio, mlx-embeddings) → Foundation (MLX, Metal kernels, Unified memory).
Limitations#
- Apple Silicon + macOS only; no Windows/Linux support
- No standalone docs site; documentation in GitHub
docs/directory - No associated academic papers
- No publicly known production deployment cases