We propose a unified multimodal framework, Universal Multi-Modal Generation Enabling Any-to-Any Transformation, that enables seamless transformation between text, audio, and image inputs and outputs. The system integrates three core capabilities: speech understanding using Whisper, visual understanding through LLaVA, and speech synthesis via PyTorch-based text-to-speech models. All modules are deployed on-premise using Docker, providing a privacy-centric execution environment and reducing operational overhead associated with cloud processing. The framework supports advanced workflows including document/PDF-to-text extraction, text-to-speech conversion, and image-driven description generation, thereby enabling accessible and interactive multimodal content pipelines. The implementation emphasizes efficient orchestration and inference to meet real-time constraints. Experimental results across multiple cross-modal tasks demonstrate robust accuracy and consistently low latency, suggesting that local, containerized multimodal systems can deliver scalable performance for practical applications. The proposed approach is particularly relevant to accessibility, education, and content creation, where rapid modality conversion and data privacy are essential.
100
75
43
100
75
43
100
75
43
Copyright © 2026, This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC-BY-NY-SA). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Corresponding Author: Ramakrishna Kolikipogu, krkrishna.cse@gmail.com
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Conflict of interest: The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Or share your Opinion
In the state of affairs of industrialization, requirements of the international strength crisis, including environmental pollution and smart...
Academic institutions have become prominent targets for evolving cyber threats, including ransom ware, credential theft, data manipulation attacks,...
To improve the efficiency of outsourced storage systems, secure deduplication mechanisms have been introduced. Among these, AES-based encryption...
Comments(0)