DISCOVER THE FUTURE OF AI AGENTS

CyberVerse

Added May 4, 2026
Agent & Tooling
Open Source
PythonNode.jsKnowledge BaseMultimodalRAGAI AgentsWeb ApplicationAgent & ToolingDocs, Tutorials & ResourcesKnowledge Management, Retrieval & RAGComputer Vision & Multimodal

Open-source digital human agent platform that creates real-time video-callable AI agents from a single photo, with RAG knowledge import, voice cloning, and modular plugin architecture.

CyberVerse is an open-source digital human agent platform designed to elevate AI interaction from text/voice chat to face-to-face real-time video communication. The project generates digital humans with real-time facial animation, natural lip sync, and subtle breathing effects from just a single photo, and endows them with Agent capabilities — they don't just converse, but can execute actual tasks.

Real-time video calls are built on WebRTC with P2P streaming and embedded TURN/NAT traversal, achieving ~1.5s first-frame latency for unlimited-duration low-latency conversations — neither pre-recorded nor turn-based. The system supports mixed voice and text input within the same session, voice interruption, session pause/resume, and user-side camera input for understanding gestures and visual cues.

The architecture employs a three-process design: Python inference service + Go API server + Vue/TypeScript frontend, communicating via gRPC. The core design is modular and pluggable: brain, face, voice, and ears are each replaceable plugins, mix-and-matchable via YAML configuration for different LLM, TTS, ASR, and avatar backends. Available avatar models include SoulX-FlashHead (1.3B, with pro/lite modes) and SoulX-LiveAct (18B), using wav2vec2 for audio feature extraction, with optional SageAttention and FlashAttention integration.

On the agent side, CyberVerse supports importing knowledge, documents, and biographical materials for RAG-based Q&A grounded in character settings. Chat history per character is persisted to disk and auto-loaded on startup. It also supports ByteDance Doubao voice cloning and live streaming output.

Built with Python (70.5%), Go (17.1%), Vue (6.5%), and TypeScript (3.8%), licensed under GPL-3.0, actively developed by maintainer dsd2077 on the main branch (59 commits, no formal Release yet). Real-time video requires GPU acceleration — the minimum known real-time configuration is RTX 4090 + FlashHead Lite mode, while full Pro experience requires dual RTX 5090s.

Unconfirmed items: No formal Release yet, production readiness unverified; non-Doubao LLM/TTS/ASR plugin support details TBD; LiveKit SFU mode, multi-agent collaboration, and embedded SDK are in the roadmap but not yet implemented; lower-end GPU support TBD; user-side visual understanding and live streaming implementation scope TBD.

Related Projects

View All

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.