Companion Tuning
A Voice-Driven Virtual Companion in VR and AR
Introduction
This project explores a voice-driven virtual companion powered by GPT-Neo and deployed within VR and AR environments. Users interact naturally with a digital entity through voice, enabled by real-time speech-to-text, natural language response, and text-to-speech systems. The goal is to deliver emotionally balanced conversations within immersive scenarios, showcasing how lightweight generative AI can enhance social presence in virtual reality.
System Architecture
The solution integrates a Unity-based frontend for 3D environment management with a Python backend pipeline. Unity handles audio input, character animation, and scene rendering, while Python processes the audio using Whisper, generates responses using GPT-Neo 125M, and synthesizes replies using Google TTS. Communication between Unity and the backend is maintained through WebSocket, ensuring seamless real-time interaction.
Model Training and Data
GPT-Neo was fine-tuned using LoRA for efficient memory usage and rapid training on an Apple M3 system.
Training spanned
five epochs and employed custom datasets categorized into therapeutic, balanced, and casual dialogue.
Tokenization and
data formatting ensured tone adherence through prompt markers like Assistant (therapeutic):.
The result is a
multi-tone dialogue system capable of generating responses tailored to different interaction contexts.
Immersive Scene Design
The Unity platform hosts three distinct interaction scenarios: a seated VR chat, a walk-and-talk simulation with NavMeshAgent navigation, and an ARKit-driven real-world overlay experience on iOS. These scenes feature synchronized lip movement, gesture animation, and environmental interaction, driven by Unity’s Animator and Timeline tools. Each scene aims to elevate realism and user immersion through natural behavior and spatial awareness.
Real-Time Audio Pipeline
Users initiate conversations via voice input, which is transcribed using Whisper, processed by GPT-Neo, and transformed into speech via gTTS. The response is then streamed back and animated in Unity in under 5 seconds. This low-latency pipeline ensures conversational fluidity and includes fallback mechanisms for robustness under suboptimal conditions.
System Performance
The model’s training loss declined from 3.21 to 2.78, and evaluation loss dropped to 2.58 over five epochs, confirming convergence. In practice, all three scenes functioned effectively with minor memory issues observed during heavy ARKit usage. Audio synchronization, latency, and fallback mechanisms performed well, validating the system's real-time viability.
Conclusion and Future Work
The companion system successfully bridges the gap between AI-powered dialogue and immersive spatial interaction. Future work could explore integrating emotion recognition, using expressive TTS systems like ElevenLabs, and expanding to Android platforms with ARCore. Enhancing model memory and integrating facial tracking could significantly boost realism and interactivity.
Project Demo
Watch a full demonstration of the VR/AR GPT companion system below:
Project Report
For a deep technical dive, read the full Companion Tuning research paper:
Developed by Aryan Singh. Explore the full implementation on GitHub.
