Skip to content
ai-driven-audio-to-video-generation-for-dynamic-content-creation-via-stable-diffusion-and-cnn-augmented-transformers-–-scientific-reports

AI-driven audio-to-video generation for dynamic content creation via stable diffusion and CNN-augmented transformers – Scientific Reports

References

  1. Zhou, P. et al. A survey on generative AI and LLM for video generation, understanding, and streaming. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2404.16038 (2024).

    Google Scholar 

  2. Kim, D., Joo, D. & Kim, J. TiVGAN: text to image to video generation with Step-by-Step evolutionary generator. IEEE Access. 8, 153113–153122 (2020).

    Google Scholar 

  3. Vondrick, C., Pirsiavash, H. & Torralba, A. Generating Videos with Scene Dynamics. In Proceedings of the Advances in Neural Information Processing Systems, 613–621. (2016).

  4. Singh, P. & Reibman, A. R. Task-aware image quality estimators for face detection. EURASIP J. Image Video Process. 2024 (1). https://doi.org/10.1186/s13640-024-00660-1 (2024).

  5. Waseem, S. et al. Multiattention-based approach for deepfake face and expression swap detection and localization. J Image Video Proc. 14 (2023). (2023). https://doi.org/10.1186/s13640-023-00614-z

  6. Bain, M., Nagrani, A., Varol, G. & Zisserman, A. Condensed Movies: Story-based Video Generation with Sparse Annotations. In ECCV. (2022).

  7. Soomro, K., Zamir, A. R. & Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv Preprint arXiv :12120402 (2012).

  8. Xu, J., Mei, T. & Yao, T. and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296. (2016).

  9. Yu, S. et al. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. arXiv preprint arXiv:2202.10571 (2022).

  10. Mittal, G., Marwah, T. & Balasubramanian, V. N. Sync-draw: Automatic video generation using deep recurrent attentive architectures, in: Proceedings of the 25th ACM internationalconference on Multimedia, pp. 1096–1104. (2017).

  11. Goodfellow, I. J. et al. Generative adversarial networks. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.1406.2661 (2014).

    Google Scholar 

  12. Villegas, R., Yang, J., Hong, S., Lin, X. & Lee, H. June. Decomposing motion and content for natural video sequence prediction. (2017). https://arxiv.org/abs/1706.08033

  13. Höppe, T., Mehrjou, A., Bauer, S., Nielsen, D. & Dittadi, A. Diffusion models for video prediction and infilling. arXiv.org. (2022)., June 15 https://arxiv.org/abs/2206.07696

  14. Chen, B., Wang, W., Wang, J. & Chen, X. Video Imagination from a Single Image with Transformation Generation. ArXiv (Cornell University). (2017). https://doi.org/10.48550/arxiv.1706.04124

  15. Tulyakov, S., Liu, M. Y., Yang, X. & Kautz, J. Mocogan: Decomposing motion and content for video generation.arXiv:1707.04993. (2017).

  16. Yan, W., Zhang, Y., Abbeel, P. & Srinivas, A. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv preprint arXiv:2104.10157. (2021).

  17. Berg, T. L., Berg, A. C. & Shih, J. Automatic Attribute Discovery and Characterization from Noisy Web Data (In ECCV, 2010).

  18. Huang, J. et al. Large language models can Self-Improve. arXiv.org. (2022)., October 20 https://arxiv.org/abs/2210.11610

  19. Mittal, A., Wang, Z. & Divakaran, A. Sync-Draw: Synchronizing sketches with audio using LSTMs. IEEE International Conference on Multimedia and Expo (ICME). (2017).

  20. Fan, L., Chen, Y. & Cheng, Y. Federated learning for mitigating bias in generative models. International Conference on Artificial Intelligence and Statistics (AISTATS). (2024).

  21. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv.org. (2021)., December 20 https://arxiv.org/abs/2112.10752

  22. Satapathy, S. K. & Parmar, D. Video Generation by Summarizing the Generated Transcript, 2023 3rd Asian Conference on Innovation in Technology (ASIANCON), Ravet IN, India, pp. 1–5, (2023). https://doi.org/10.1109/ASIANCON58793.2023.10270304

  23. Shankar, M. G. & Surendran, D. An effective video captioning based on Language description using a novel graylag deep Kookaburra reinforcement learning. J. Image Video Proc. 2025 (1). https://doi.org/10.1186/s13640-024-00662-z (2025).

  24. Saito, M., Matsumoto, E. & Saito, S. Temporal generative adversarial nets with singular value clipping, in Proc. IEEE Int. Conf. Comput. Vis., Oct. p. 28302839. (2017).

  25. Abu Sufian AI-Generated videos and deepfakes: A technical primer. TechRxiv August. 12 https://doi.org/10.36227/techrxiv.172348990.01007128/v1 (2024).

  26. Mansimov, E. Jr., Parisotto, E., Ba, J. L. & Salakhutdinov, R. (2016). & Department of Computer Science, University of Toronto. Generating images from captions with attention. In ICLR 2016.

  27. Pan, Y., Mei, T., Yao, T., Li, H. & Rui, Y. Jointly modelling embedding and translation to bridge video and text. Conference on Computer Vision and Pattern Recognition (CVPR) (2016).

  28. Zhu, X. et al. RMER-DT: robust multimodal emotion recognition in conversational contexts based on diffusion and Transformers. Inform. Fusion. 123, 103268. https://doi.org/10.1016/j.inffus.2025.103268 (2025).

    Google Scholar 

  29. Esser, P., Chiu, J., Atighehchian, P., Granskog, J. & Germanidis, A. Structure and Content-Guided Video Synthesis with Diffusion Models. arXiv.org. (2023)., February 6 https://arxiv.org/abs/2302.03011

  30. Wang, R. et al. RAFT: robust adversarial fusion transformer for multimodal sentiment analysis. Array 100445. https://doi.org/10.1016/j.array.2025.100445 (2025).

  31. Wang, R. et al. CIME: contextual Interaction-Based multimodal emotion analysis with enhanced semantic information. IEEE Trans. Comput. Social Syst. 1–11. https://doi.org/10.1109/tcss.2025.3572495 (2025).

  32. Wang, R. et al. Contrastive-Based removal of negative information in multimodal emotion analysis. Cogn. Comput. 17 (3). https://doi.org/10.1007/s12559-025-10463-9 (2025).

  33. Huang, Y., Zhu, X., Wang, R., Xie, Y. & Fong, S. A dynamic Global–Local Spatiotemporal graph framework for Multi-City PM2.5 Long-Term forecasting. Remote Sens. 17 (16), 2750. https://doi.org/10.3390/rs17162750 (2025).

    Google Scholar 

  34. Wang, J. et al. Knowledge generation and distillation for road segmentation in intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. 1–13. https://doi.org/10.1109/tits.2025.3577794 (2025).

  35. Ye, Y. et al. Advancing federated domain generalization in ophthalmology: vision enhancement and consistency assurance for multicenter fundus image segmentation. Pattern Recogn. 111993. https://doi.org/10.1016/j.patcog.2025.111993 (2025).

  36. Gao, M. et al. Towards trustworthy image super-resolution via symmetrical and recursive artificial neural network. Image and Vision Computing, 105519. (2025). https://doi.org/10.1016/j.imavis.2025.105519

  37. Zhu, X. et al. A client-server based recognition system: Non-contact single/multiple emotional and behavioral state assessment methods. Comput. Methods Programs Biomed. 260, 108564. https://doi.org/10.1016/j.cmpb.2024.108564 (2024).

    Google Scholar 

  38. Guo, S., Li, Q., Gao, M., Zhu, X. & Rida, I. Generalizable deepfake detection via Spatial kernel selection and halo attention network. Image Vis. Comput. 105582. https://doi.org/10.1016/j.imavis.2025.105582 (2025).

  39. Song, W. et al. Deepfake detection via feature refinement and enhancement network. Image Vis. Comput. 105663. https://doi.org/10.1016/j.imavis.2025.105663 (2025).

  40. Salimans, T. et al. Improved techniques for training GANs, in Proc. Adv. Neural Inf. Process. Syst., p. 22342242. (2016).

  41. Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206–5210, South Brisbane, QLD, Australia, 2015, pp. 5206–5210, (2015). https://doi.org/10.1109/ICASSP.2015.7178964

  42. Balaji, Y. et al. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv.org. https://arxiv.org/abs/2211.01324 (2022), November 2.

  43. Lin, T. Y. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014. ECCV 2014 Vol. 8693 (eds Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. et al.) (Springer, 2014). https://doi.org/10.1007/978-3-319-10602-1_48.

    Google Scholar 

  44. Blattmann, A. et al. Stable video diffusion: scaling latent video diffusion models to large datasets. ArXiv Preprint (2023). arXiv:2311.15127.

  45. Nyame, L. & Staphord, B. Generative Artificial Intelligence Trend on Video Generation Preprints. (2024). https://doi.org/10.20944/preprints202409.0195.v1

  46. Wu, J. et al. Tune-A-Video: One-Shot Tuning of Text-to-Video Diffusion Models. (2023). In CVPR.

  47. Zhang, C., Zhang, C., Zhang, M. & Kweon, I. S. Text-to-image diffusion models in generative ai: A survey, arXiv preprint arXiv:2303.07909 (2023).

  48. Fan, F., Luo, C., Gao, W. & Zhan, J. AIGCBench: Comprehensive evaluation of Image-to-Video content generated by AI. arXiv.org. (2024)., January 3 https://arxiv.org/abs/2401.01651

  49. Weng, W. et al. Art-v: Autoregressive text-to-video generation with diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7395–7405. (2024).

  50. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. arXiv (Cornell University). (2020). https://doi.org/10.48550/arxiv.2006.11239

  51. Blattmann, A. et al. Align your latents: High-Resolution video synthesis with latent diffusion models. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2304.08818 (2023).

    Google Scholar 

  52. Ma, X. et al. Latte: Latent diffusion transformer for video generation. arXiv.org. (2024)., January 5 https://arxiv.org/abs/2401.03048v1

  53. Li, C. et al. A survey on Long video Generation: Challenges, methods, and Prospects. arXiv (Cornell University). (2024). https://doi.org/10.48550/arxiv.2403.16407

  54. Unterthiner, T., Nessler, B., Heigold, G., Aichbauer, M. & Hochreiter, S. Towards Accurate Generative Models of Video: A New Metric & Challenges. arXiv preprint arXiv:1812.01717. (2018).

  55. Hessel, M. et al. AViTAR: Adversarial Video-to-Audio Retrieval. arXiv preprint arXiv:2107.06818. (2021).

  56. Liu, Y. et al. EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. arXiv.org. (2023)., October 17 https://arxiv.org/abs/2310.11440

  57. Patterson, D. et al. Carbon emissions and large neural network training. arXiv.org. (2021)., April 21 https://arxiv.org/abs/2104.10350

Download references

colind88

Back To Top