AI-driven audio-to-video generation for dynamic content creation via stable diffusion and CNN-augmented transformers – Scientific Reports

24 February 2026
colind88
News Feed

References

Zhou, P. et al. A survey on generative AI and LLM for video generation, understanding, and streaming. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2404.16038 (2024).

Google Scholar
Kim, D., Joo, D. & Kim, J. TiVGAN: text to image to video generation with Step-by-Step evolutionary generator. IEEE Access. 8, 153113–153122 (2020).

Google Scholar
Vondrick, C., Pirsiavash, H. & Torralba, A. Generating Videos with Scene Dynamics. In Proceedings of the Advances in Neural Information Processing Systems, 613–621. (2016).
Singh, P. & Reibman, A. R. Task-aware image quality estimators for face detection. EURASIP J. Image Video Process. 2024 (1). https://doi.org/10.1186/s13640-024-00660-1 (2024).
Waseem, S. et al. Multiattention-based approach for deepfake face and expression swap detection and localization. J Image Video Proc. 14 (2023). (2023). https://doi.org/10.1186/s13640-023-00614-z
Bain, M., Nagrani, A., Varol, G. & Zisserman, A. Condensed Movies: Story-based Video Generation with Sparse Annotations. In ECCV. (2022).
Soomro, K., Zamir, A. R. & Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv Preprint arXiv :12120402 (2012).
Xu, J., Mei, T. & Yao, T. and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296. (2016).
Yu, S. et al. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. arXiv preprint arXiv:2202.10571 (2022).
Mittal, G., Marwah, T. & Balasubramanian, V. N. Sync-draw: Automatic video generation using deep recurrent attentive architectures, in: Proceedings of the 25th ACM internationalconference on Multimedia, pp. 1096–1104. (2017).
Goodfellow, I. J. et al. Generative adversarial networks. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.1406.2661 (2014).

Google Scholar
Villegas, R., Yang, J., Hong, S., Lin, X. & Lee, H. June. Decomposing motion and content for natural video sequence prediction. (2017). https://arxiv.org/abs/1706.08033
Höppe, T., Mehrjou, A., Bauer, S., Nielsen, D. & Dittadi, A. Diffusion models for video prediction and infilling. arXiv.org. (2022)., June 15 https://arxiv.org/abs/2206.07696
Chen, B., Wang, W., Wang, J. & Chen, X. Video Imagination from a Single Image with Transformation Generation. ArXiv (Cornell University). (2017). https://doi.org/10.48550/arxiv.1706.04124
Tulyakov, S., Liu, M. Y., Yang, X. & Kautz, J. Mocogan: Decomposing motion and content for video generation.arXiv:1707.04993. (2017).
Yan, W., Zhang, Y., Abbeel, P. & Srinivas, A. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv preprint arXiv:2104.10157. (2021).
Berg, T. L., Berg, A. C. & Shih, J. Automatic Attribute Discovery and Characterization from Noisy Web Data (In ECCV, 2010).
Huang, J. et al. Large language models can Self-Improve. arXiv.org. (2022)., October 20 https://arxiv.org/abs/2210.11610
Mittal, A., Wang, Z. & Divakaran, A. Sync-Draw: Synchronizing sketches with audio using LSTMs. IEEE International Conference on Multimedia and Expo (ICME). (2017).
Fan, L., Chen, Y. & Cheng, Y. Federated learning for mitigating bias in generative models. International Conference on Artificial Intelligence and Statistics (AISTATS). (2024).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv.org. (2021)., December 20 https://arxiv.org/abs/2112.10752
Satapathy, S. K. & Parmar, D. Video Generation by Summarizing the Generated Transcript, 2023 3rd Asian Conference on Innovation in Technology (ASIANCON), Ravet IN, India, pp. 1–5, (2023). https://doi.org/10.1109/ASIANCON58793.2023.10270304
Shankar, M. G. & Surendran, D. An effective video captioning based on Language description using a novel graylag deep Kookaburra reinforcement learning. J. Image Video Proc. 2025 (1). https://doi.org/10.1186/s13640-024-00662-z (2025).
Saito, M., Matsumoto, E. & Saito, S. Temporal generative adversarial nets with singular value clipping, in Proc. IEEE Int. Conf. Comput. Vis., Oct. p. 28302839. (2017).
Abu Sufian AI-Generated videos and deepfakes: A technical primer. TechRxiv August. 12 https://doi.org/10.36227/techrxiv.172348990.01007128/v1 (2024).
Mansimov, E. Jr., Parisotto, E., Ba, J. L. & Salakhutdinov, R. (2016). & Department of Computer Science, University of Toronto. Generating images from captions with attention. In ICLR 2016.
Pan, Y., Mei, T., Yao, T., Li, H. & Rui, Y. Jointly modelling embedding and translation to bridge video and text. Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
Zhu, X. et al. RMER-DT: robust multimodal emotion recognition in conversational contexts based on diffusion and Transformers. Inform. Fusion. 123, 103268. https://doi.org/10.1016/j.inffus.2025.103268 (2025).

Google Scholar
Esser, P., Chiu, J., Atighehchian, P., Granskog, J. & Germanidis, A. Structure and Content-Guided Video Synthesis with Diffusion Models. arXiv.org. (2023)., February 6 https://arxiv.org/abs/2302.03011
Wang, R. et al. RAFT: robust adversarial fusion transformer for multimodal sentiment analysis. Array 100445. https://doi.org/10.1016/j.array.2025.100445 (2025).
Wang, R. et al. CIME: contextual Interaction-Based multimodal emotion analysis with enhanced semantic information. IEEE Trans. Comput. Social Syst. 1–11. https://doi.org/10.1109/tcss.2025.3572495 (2025).
Wang, R. et al. Contrastive-Based removal of negative information in multimodal emotion analysis. Cogn. Comput. 17 (3). https://doi.org/10.1007/s12559-025-10463-9 (2025).
Huang, Y., Zhu, X., Wang, R., Xie, Y. & Fong, S. A dynamic Global–Local Spatiotemporal graph framework for Multi-City PM2.5 Long-Term forecasting. Remote Sens. 17 (16), 2750. https://doi.org/10.3390/rs17162750 (2025).

Google Scholar
Wang, J. et al. Knowledge generation and distillation for road segmentation in intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. 1–13. https://doi.org/10.1109/tits.2025.3577794 (2025).
Ye, Y. et al. Advancing federated domain generalization in ophthalmology: vision enhancement and consistency assurance for multicenter fundus image segmentation. Pattern Recogn. 111993. https://doi.org/10.1016/j.patcog.2025.111993 (2025).
Gao, M. et al. Towards trustworthy image super-resolution via symmetrical and recursive artificial neural network. Image and Vision Computing, 105519. (2025). https://doi.org/10.1016/j.imavis.2025.105519
Zhu, X. et al. A client-server based recognition system: Non-contact single/multiple emotional and behavioral state assessment methods. Comput. Methods Programs Biomed. 260, 108564. https://doi.org/10.1016/j.cmpb.2024.108564 (2024).

Google Scholar
Guo, S., Li, Q., Gao, M., Zhu, X. & Rida, I. Generalizable deepfake detection via Spatial kernel selection and halo attention network. Image Vis. Comput. 105582. https://doi.org/10.1016/j.imavis.2025.105582 (2025).
Song, W. et al. Deepfake detection via feature refinement and enhancement network. Image Vis. Comput. 105663. https://doi.org/10.1016/j.imavis.2025.105663 (2025).
Salimans, T. et al. Improved techniques for training GANs, in Proc. Adv. Neural Inf. Process. Syst., p. 22342242. (2016).
Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206–5210, South Brisbane, QLD, Australia, 2015, pp. 5206–5210, (2015). https://doi.org/10.1109/ICASSP.2015.7178964
Balaji, Y. et al. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv.org. https://arxiv.org/abs/2211.01324 (2022), November 2.
Lin, T. Y. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014. ECCV 2014 Vol. 8693 (eds Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. et al.) (Springer, 2014). https://doi.org/10.1007/978-3-319-10602-1_48.

Google Scholar
Blattmann, A. et al. Stable video diffusion: scaling latent video diffusion models to large datasets. ArXiv Preprint (2023). arXiv:2311.15127.
Nyame, L. & Staphord, B. Generative Artificial Intelligence Trend on Video Generation Preprints. (2024). https://doi.org/10.20944/preprints202409.0195.v1
Wu, J. et al. Tune-A-Video: One-Shot Tuning of Text-to-Video Diffusion Models. (2023). In CVPR.
Zhang, C., Zhang, C., Zhang, M. & Kweon, I. S. Text-to-image diffusion models in generative ai: A survey, arXiv preprint arXiv:2303.07909 (2023).
Fan, F., Luo, C., Gao, W. & Zhan, J. AIGCBench: Comprehensive evaluation of Image-to-Video content generated by AI. arXiv.org. (2024)., January 3 https://arxiv.org/abs/2401.01651
Weng, W. et al. Art-v: Autoregressive text-to-video generation with diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7395–7405. (2024).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. arXiv (Cornell University). (2020). https://doi.org/10.48550/arxiv.2006.11239
Blattmann, A. et al. Align your latents: High-Resolution video synthesis with latent diffusion models. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2304.08818 (2023).

Google Scholar
Ma, X. et al. Latte: Latent diffusion transformer for video generation. arXiv.org. (2024)., January 5 https://arxiv.org/abs/2401.03048v1
Li, C. et al. A survey on Long video Generation: Challenges, methods, and Prospects. arXiv (Cornell University). (2024). https://doi.org/10.48550/arxiv.2403.16407
Unterthiner, T., Nessler, B., Heigold, G., Aichbauer, M. & Hochreiter, S. Towards Accurate Generative Models of Video: A New Metric & Challenges. arXiv preprint arXiv:1812.01717. (2018).
Hessel, M. et al. AViTAR: Adversarial Video-to-Audio Retrieval. arXiv preprint arXiv:2107.06818. (2021).
Liu, Y. et al. EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. arXiv.org. (2023)., October 17 https://arxiv.org/abs/2310.11440
Patterson, D. et al. Carbon emissions and large neural network training. arXiv.org. (2021)., April 21 https://arxiv.org/abs/2104.10350

Download references

References

Share This

colind88

Related Posts

REACH OUT!