DISCOVER THE FUTURE OF AI AGENTS

vllm-mlx

Added May 8, 2026
Model & Inference Framework
Open Source
PythonLarge Language ModelsMultimodalvLLMCLIModel & Inference FrameworkModel Training & InferenceProtocol, API & IntegrationComputer Vision & Multimodal

A vLLM-style inference server for Apple Silicon with a native MLX backend, exposing both OpenAI and Anthropic compatible APIs in a single process, featuring multimodal unified serving, continuous batching, paged KV cache, and SSD-tiered caching.

Core Positioning#

vllm-mlx is an LLM inference serving framework designed exclusively for Apple Silicon (M1/M2/M3/M4) + Metal GPU + macOS. Inspired by vLLM, it is fully reimplemented on the Apple MLX framework, addressing the lack of high-throughput, production-grade LLM inference servers in the Apple ecosystem.

API Compatibility#

  • OpenAI compatible endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/responses
  • Anthropic compatible endpoint: /v1/messages (full streaming, tool use, system prompts support)
  • MCP Tool Calling: 12 built-in tool parsers covering OpenAI, Anthropic, Gemini, Qwen, DeepSeek, Gemma, etc.
  • Structured Output: JSON Schema via response_format (based on lm-format-enforcer)

Throughput & Memory Optimization#

  • Continuous Batching: high-concurrency request throughput
  • Paged KV Cache: prefix-sharing memory-efficient KV cache
  • SSD-tiered KV Cache: prefix cache overflow to disk (--ssd-cache-dir) for long-context agents
  • Warm Prompts: pre-load popular prefixes at startup (--warm-prompts), TTFT improvement 1.3–2.25x
  • Prefix Cache: trie-based cross-request prefix sharing

Multimodal Processing#

  • Unified serving: text + image + video + audio
  • Vision models: Gemma 3/4, Qwen3-VL, Pixtral, Llama vision
  • Audio input via audio_url content blocks
  • Native TTS: 11 voices, 15+ languages (Kokoro, Chatterbox, VibeVoice, VoxCPM)
  • Native STT: Whisper series, RTF up to 197x (M4 Max)

Advanced Inference & Acceleration#

  • Reasoning Extraction: Qwen3, DeepSeek-R1 (--reasoning-parser)
  • MoE Expert Reduction: --moe-top-k, Qwen3-30B-A3B speedup 7–16%
  • Speculative Decoding: --mtp (e.g., Qwen3-Next)
  • Sparse Prefill: --spec-prefill to reduce TTFT

Observability#

  • Prometheus metrics: /metrics endpoint (--metrics)
  • Built-in benchmarking: vllm-mlx bench-serve with CSV/JSON/SQLite output

Typical Use Cases#

  • Local replacement backend for Claude Code / OpenCode
  • Local high-concurrency LLM serving
  • Multimodal agent unified entry point
  • Ultra-long context conversations (via disk overflow cache)
  • Localized TTS/STT processing
  • Text embedding and reranking services

Quick Start#

uv tool install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

Architecture#

API routing layer (OpenAI/Anthropic/rerank/metrics) → Scheduling layer (Continuous batching, Paged KV cache, Prefix cache, SSD tiering) → Model execution layer (mlx-lm, mlx-vlm, mlx-audio, mlx-embeddings) → Foundation (MLX, Metal kernels, Unified memory).

Limitations#

  • Apple Silicon + macOS only; no Windows/Linux support
  • No standalone docs site; documentation in GitHub docs/ directory
  • No associated academic papers
  • No publicly known production deployment cases

Related Projects

View All

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.