CyberVerse

Open-source digital human agent platform that creates real-time video-callable AI agents from a single photo, with RAG knowledge import, voice cloning, and modular plugin architecture.

CyberVerse is an open-source digital human agent platform designed to elevate AI interaction from text/voice chat to face-to-face real-time video communication. The project generates digital humans with real-time facial animation, natural lip sync, and subtle breathing effects from just a single photo, and endows them with Agent capabilities — they don't just converse, but can execute actual tasks.

Real-time video calls are built on WebRTC with P2P streaming and embedded TURN/NAT traversal, achieving ~1.5s first-frame latency for unlimited-duration low-latency conversations — neither pre-recorded nor turn-based. The system supports mixed voice and text input within the same session, voice interruption, session pause/resume, and user-side camera input for understanding gestures and visual cues.

The architecture employs a three-process design: Python inference service + Go API server + Vue/TypeScript frontend, communicating via gRPC. The core design is modular and pluggable: brain, face, voice, and ears are each replaceable plugins, mix-and-matchable via YAML configuration for different LLM, TTS, ASR, and avatar backends. Available avatar models include SoulX-FlashHead (1.3B, with pro/lite modes) and SoulX-LiveAct (18B), using wav2vec2 for audio feature extraction, with optional SageAttention and FlashAttention integration.

On the agent side, CyberVerse supports importing knowledge, documents, and biographical materials for RAG-based Q&A grounded in character settings. Chat history per character is persisted to disk and auto-loaded on startup. It also supports ByteDance Doubao voice cloning and live streaming output.

Built with Python (70.5%), Go (17.1%), Vue (6.5%), and TypeScript (3.8%), licensed under GPL-3.0, actively developed by maintainer dsd2077 on the main branch (59 commits, no formal Release yet). Real-time video requires GPU acceleration — the minimum known real-time configuration is RTX 4090 + FlashHead Lite mode, while full Pro experience requires dual RTX 5090s.

Unconfirmed items: No formal Release yet, production readiness unverified; non-Doubao LLM/TTS/ASR plugin support details TBD; LiveKit SFU mode, multi-agent collaboration, and embedded SDK are in the roadmap but not yet implemented; lower-end GPU support TBD; user-side visual understanding and live streaming implementation scope TBD.

Related Projects

Genkit

Gobii Platform

Semble

STAY UPDATED