vllm-mlx

A vLLM-style inference server for Apple Silicon with a native MLX backend, exposing both OpenAI and Anthropic compatible APIs in a single process, featuring multimodal unified serving, continuous batching, paged KV cache, and SSD-tiered caching.

Core Positioning#

vllm-mlx is an LLM inference serving framework designed exclusively for Apple Silicon (M1/M2/M3/M4) + Metal GPU + macOS. Inspired by vLLM, it is fully reimplemented on the Apple MLX framework, addressing the lack of high-throughput, production-grade LLM inference servers in the Apple ecosystem.

API Compatibility#

OpenAI compatible endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/responses
Anthropic compatible endpoint: /v1/messages (full streaming, tool use, system prompts support)
MCP Tool Calling: 12 built-in tool parsers covering OpenAI, Anthropic, Gemini, Qwen, DeepSeek, Gemma, etc.
Structured Output: JSON Schema via response_format (based on lm-format-enforcer)

Throughput & Memory Optimization#

Continuous Batching: high-concurrency request throughput
Paged KV Cache: prefix-sharing memory-efficient KV cache
SSD-tiered KV Cache: prefix cache overflow to disk (--ssd-cache-dir) for long-context agents
Warm Prompts: pre-load popular prefixes at startup (--warm-prompts), TTFT improvement 1.3–2.25x
Prefix Cache: trie-based cross-request prefix sharing

Multimodal Processing#

Unified serving: text + image + video + audio
Vision models: Gemma 3/4, Qwen3-VL, Pixtral, Llama vision
Audio input via audio_url content blocks
Native TTS: 11 voices, 15+ languages (Kokoro, Chatterbox, VibeVoice, VoxCPM)
Native STT: Whisper series, RTF up to 197x (M4 Max)

Advanced Inference & Acceleration#

Reasoning Extraction: Qwen3, DeepSeek-R1 (--reasoning-parser)
MoE Expert Reduction: --moe-top-k, Qwen3-30B-A3B speedup 7–16%
Speculative Decoding: --mtp (e.g., Qwen3-Next)
Sparse Prefill: --spec-prefill to reduce TTFT

Observability#

Prometheus metrics: /metrics endpoint (--metrics)
Built-in benchmarking: vllm-mlx bench-serve with CSV/JSON/SQLite output

Typical Use Cases#

Local replacement backend for Claude Code / OpenCode
Local high-concurrency LLM serving
Multimodal agent unified entry point
Ultra-long context conversations (via disk overflow cache)
Localized TTS/STT processing
Text embedding and reranking services

Quick Start#

uv tool install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

Architecture#

API routing layer (OpenAI/Anthropic/rerank/metrics) → Scheduling layer (Continuous batching, Paged KV cache, Prefix cache, SSD tiering) → Model execution layer (mlx-lm, mlx-vlm, mlx-audio, mlx-embeddings) → Foundation (MLX, Metal kernels, Unified memory).

Limitations#

Apple Silicon + macOS only; no Windows/Linux support
No standalone docs site; documentation in GitHub docs/ directory
No associated academic papers
No publicly known production deployment cases

Core Positioning#

API Compatibility#

Throughput & Memory Optimization#

Multimodal Processing#

Advanced Inference & Acceleration#

Observability#

Typical Use Cases#

Quick Start#

Architecture#

Limitations#

Related Projects

Genkit

Gobii Platform

Semble

STAY UPDATED