Rapid-MLX

A local AI inference engine for Apple Silicon with OpenAI-compatible API, supporting multi-modal, tool calling, and smart cloud routing.

Rapid-MLX is a local AI inference engine designed specifically for Apple Silicon (M1/M2/M3/M4), built on the Apple MLX framework with unified memory and native Metal compute kernels for high-performance inference. It provides a complete OpenAI-compatible API (/v1/chat/completions, /v1/completions, /v1/messages, /v1/embeddings, audio endpoints, etc.) as a drop-in replacement, directly integrating with mainstream tools and frameworks including Cursor, Claude Code, Aider, PydanticAI, LangChain, and smolagents.

Key features include: Tool Calling with support for 17 parser formats and automatic recovery of corrupted tool outputs in quantized models; Reasoning Separation that isolates chain-of-thought output from models like Qwen3 and DeepSeek-R1 into a dedicated reasoning_content field; Prompt Cache with cross-request persistent caching delivering 2-5x TTFT improvement (using RNN state snapshot recovery for hybrid models); Smart Cloud Routing that automatically routes large-context requests to cloud LLMs based on a new-token threshold; and multi-modal support covering vision, audio, video understanding, and text embeddings. Additional capabilities include KV cache quantization, continuous batching, logprobs, and structured JSON output.

The project is at version 0.6.9 in Beta status, licensed under Apache-2.0, and requires Python >= 3.10. It offers three installation methods (Homebrew, pip, one-line script), includes built-in rapid-mlx doctor diagnostics and rapid-mlx agents --test compatibility testing, and maintains 2100+ unit tests.

Installation:

brew install raullenchai/rapid-mlx/rapid-mlx
pip install rapid-mlx
curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash

Quick Start:

rapid-mlx serve gemma-4-26b
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Say hello"}]}'

Python SDK (using OpenAI SDK directly):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Say hello"}],
)
print(response.choices[0].message.content)

Unconfirmed: Performance benchmarks (e.g., "4.2x faster than Ollama") are self-reported and lack independent third-party verification; long-term maintenance commitment for Day-0 frontier model support is unclear; cloud routing requires user-configured API keys with no detailed list of supported cloud models.

Related Projects

Genkit

Gobii Platform

Semble

STAY UPDATED