skip to Main Content
The-use-of-deep-learning-and-artificial-intelligence-based-digital-technologies-in-art-education

The use of deep learning and artificial intelligence-based digital technologies in art education

Introduction

Research background and motivations

With the rapid progress of information technology, the application of digital technology in various fields has been continuously expanding, and the education sector is no exception. In recent years, advancements in artificial intelligence (AI) and deep learning (DL) technologies have brought about profound changes in educational models and teaching methods. In particular, in the field of art education, traditional teaching methods are facing an increasing number of challenges, such as insufficient innovation in teaching content, the monotony of teaching methods, and the imbalance of student learning outcomes1,2,3. Therefore, how to use modern technologies to promote reform and innovation in art education has become an increasingly important issue of concern in the education field. The rise of DL and AI technologies provides new possibilities for solving these problems4. DL, by simulating the working principles of human brain neurons, can process massive amounts of data, conduct intelligent analysis and decision-making, and has made significant progress in fields such as image recognition, image generation, and automatic creation. AI, through machine learning and pattern recognition, can assist teachers in evaluating student performance, recommending personalized learning paths, and automating teaching feedback, thus improving teaching efficiency and enhancing the student learning experience5. In art education, the application of DL and AI can break the time and space limitations of traditional education, and help students gain access to richer and more diverse learning resources. With intelligent creative tools, students can not only engage in virtual painting creation but also create, display, and evaluate their works on digital platforms, further enhancing the interactivity and creativity of learning6,7,8.

However, despite the significant achievements AI and DL technologies have made in many fields, there are still many challenges in effectively applying these technologies in art education. These challenges include the operability of the technology, the degree of system intelligence, the accuracy of data, and the optimization of system performance. Therefore, the motivation of this work is to explore how to combine DL and AI technologies to design and optimize automated teaching systems for art education. The goal is to create systems that can better assist educators, provide more engaging learning experiences for students, and support the development of critical artistic skills. By addressing these challenges, this work intends to provide feasible technological solutions that are practical and scalable for educational practice. Ultimately, the purpose is to promote the digital transformation of art education, fostering innovation in teaching methods, and ensuring that art education can evolve alongside advances in technology.

Research objectives

The research objectives are as follows:

  1. (1)

    Explore the potential applications of DL and AI in art education.

  2. (2)

    Design and implement an automated art education system based on DL and AI.

  3. (3)

    Optimize the automated teaching system in art education.

Literature review

In previous studies, Hutson and Cotroneo (2023) found that DL models had enormous potential in image generation and art creation. By using convolutional neural network (CNN) for style transfer and automatic generation of artworks, they demonstrated how DL technology could be used to automatically create art. This offered new ideas for creative teaching in art education9. Rangel and Duart (2023) suggested that DL could be applied in art education not only for art creation but also for style analysis and automatic grading. By training DL models to classify and evaluate students’ works, the system could effectively reduce teachers’ subjective biases, and lead to a more objective and efficient evaluation system10. O’Dea and O’Dea (2023) argued that one of AI’s most important applications in art education was the construction of an automated grading system. By training AI algorithms to analyze students’ artworks, the system could provide reasonable scores based on factors such as color and composition, greatly improving the efficiency and accuracy of grading11. Alazzam et al. (2023) further explored AI’s application in personalized teaching. They stated that AI could analyze students’ learning paths and the characteristics of their works to provide personalized learning suggestions, better meeting students’ diverse learning needs and enhancing their art creation abilities12. Crescenzi-Lanna (2023) pointed out that the application of Virtual Reality (VR) technology in art education provided students with an immersive environment for artistic creation. In a VR environment, students can interact with virtual objects and experience the process of artistic creation, making this immersive learning approach more effective for students to understand and master art creation techniques13.

In summary, the applications of DL, AI, and VR technology in art education are distinctive, with significant progress made in areas such as art creation, intelligent evaluation, personalized learning, and immersive learning experiences. However, despite the many innovations these technologies have introduced, challenges remain, including the complexity of technology implementation, data quality issues, and the adaptability of teachers and students to new technologies.

Research methodology

Theoretical foundations of DL and AI

AI refers to the simulation of human cognitive processes, enabling machines to perform tasks that typically require human intelligence14,15,16. The development of AI encompasses several subfields, including machine learning, natural language processing, computer vision, and intelligent control. In art education, AI applications primarily focus on areas such as intelligent assessment, personalized learning, and artistic creation assistance17. DL is an important branch of machine learning. It is based on the theory of artificial neural networks, simulating the neural network structure of the human brain through multi-layered neuron models. DL is capable of performing complex pattern recognition, data classification, and prediction tasks. DL models exhibit powerful learning abilities, particularly excelling in tasks like large-scale data processing and image recognition18,19,20. The combination of DL and AI enables machines to possess enhanced intelligence capabilities, demonstrating tremendous potential in areas such as art creation, evaluation, and feedback21. Through DL, AI is not only capable of performing efficient art analysis but also generating art with human-like creative styles. Furthermore, DL can analyze vast amounts of historical data, and extract patterns that offer meaningful guidance for students’ creations, thus providing personalized learning recommendations and creative support22.

In summary, AI and DL have brought unprecedented opportunities to art education. By integrating AI’s intelligent assessment and personalized recommendations with DL’s powerful image processing and generation abilities, educators can offer students a more diverse and individualized learning experience. The introduction of these technologies not only breaks the limitations of traditional art education but also provides strong technological support for the innovation and development of future art education23,24,25.

Digital technology application framework in Art education

With the rapid development of digital technologies, especially the applications of AI and DL, traditional art education is facing unprecedented opportunities for transformation. By introducing digital technologies into art education, it is possible to significantly improve teaching efficiency and effectiveness, while also providing students with richer and more personalized learning experiences26,27,28. To better understand how these technologies can serve art education, this work proposes a systematic framework for the application of digital technologies. This framework consists of three main modules: Intelligent Creative Assistance, Personalized Learning and Evaluation, and Virtual Interaction and Creative Display. Table 1 presents the details.

Table 1 Systematic digital technology application framework.

Full size table

Through the design of these three modules, the application framework of digital technologies in art education demonstrates its immense potential. From intelligent creative assistance to personalized learning and evaluation, and to virtual interaction and creative display, digital technologies offer a brand-new teaching model for art education. The introduction of these technologies not only optimizes the teaching process and improves the quality of education but also provides students with richer, more diversified learning experiences and inspires their creativity and artistic potential29,30,31. With the continuous development and refinement of these technologies, digital technologies will play an increasingly important role in the future of art education32.

Optimization of the automated teaching system in art education

Based on deep learning and AI technologies, this work constructs and optimizes an automated teaching system for art education, named Creative Intelligence Cloud (CIC). The system’s overall architecture includes five core modules: the data processing module, the deep learning model module, the intelligent recommendation module, the interactive feedback module, and the user learning analysis module. The system aims to provide integrated support for art style learning, work creation, intelligent evaluation, and personalized teaching. Figure 1 illustrates the framework of the automated teaching system designed.

Fig. 1
figure 1

Automated teaching system framework.

Full size image

In the data processing module, the system first cleans and preprocesses the collected image data, including removing samples with missing labels, image blurriness, and inconsistent sizes. All images are resized to 512 × 512 pixels and normalized. Meanwhile, data augmentation techniques (such as rotation, cropping, flipping, and brightness perturbation) are used to expand data diversity and enhance the model’s ability to learn artistic style details. The data mainly comes from the WikiArt dataset and user-uploaded creations. The deep learning model module is the core of the system, mainly including three sub-models: the art style recognition model, the art style transfer model, and the art work scoring model. The art style recognition model uses a residual network as its backbone, based on the ResNet-50 architecture, which has the ability to extract deep features and is suitable for capturing complex artistic image features. On this basis, a transformer mechanism is introduced to enhance the model’s perception of global style features in images. The loss function is cross-entropy, and the AdamW optimizer is used for iterative updates.

The art style transfer model uses a transfer method based on Generative Adversarial Network (GAN), specifically employing an improved Convolutional Neural Network (CNN) structure as the generator and introducing self-attention mechanisms to enhance style feature modeling capabilities33. The discriminator uses a Patch-based Generative Adversarial Network Discriminator (PatchGAN) structure, which can make judgments on local regions of the image, improving the naturalness and style detail fidelity of the generated images. The model training process incorporates three types of loss functions: content loss to maintain the original structure of the image, style loss to ensure consistency of style features, and adversarial loss to enhance the realism of image generation. This model performs particularly well in transferring styles such as oil painting and sketching. The art work scoring model combines CNN with Long Short-Term Memory (LSTM) networks. The convolutional part is based on the EfficientNet architecture, responsible for extracting spatial features of the image in dimensions such as color, composition, and texture. LSTM is used to model the temporal correlations between scoring criteria. This model is used to evaluate student works multidimensionally in terms of composition, color usage, and creativity. To improve the accuracy and adaptability of scoring, the system introduces reinforcement learning strategies to automatically adjust evaluation standards based on historical scoring data. The intelligent recommendation module provides personalized learning content and creative tasks based on students’ learning behaviors and style preferences. The recommendation mechanism integrates content-based recommendation and collaborative filtering algorithms, while using a transformer encoder to build user style profiles and combining K-means clustering to classify student works, achieving style matching with recommendation tasks. This module also introduces policy gradient algorithms to dynamically update personalized recommendations and optimize learning paths.

The interactive feedback module provides real-time teaching support for the system. This module integrates object detection algorithms to detect the structural integrity and color matching of works in real time and offers adjustment suggestions. During the scoring process, the system uses visualization methods to show students the basis for scoring, enhancing teaching transparency. For generating learning paths, a Markov decision process is used to create personalized learning plans for students, with Bayesian optimization adjusting the pace of learning goals. The user learning analysis module continuously collects students’ behavior data on the platform to build user profiles and assess their learning trajectories, supporting teachers’ instructional interventions and dynamic adjustments of system-recommended content.

Overall, the system’s modules have clear functions and well-structured algorithms. They meet the needs of art education for style learning, creation, intelligent scoring, interaction feedback, and teaching management. They also provide a technical basis for system performance optimization. For the modules mentioned above, more detailed ablation experiments will be carried out next to clarify the contributions of each structure to the system’s overall performance.

Experimental design and performance evaluation

Datasets collection

The dataset used is the WikiArt dataset, which is widely employed in DL research for art style analysis and artwork generation. The dataset covers various art styles, genres, and artists. It is an ideal dataset for tasks such as art style transfer, automated art creation evaluation, and artwork classification. The dataset information is as follows:

  1. (1)

    Data Volume: It contains images of approximately 80,000 artworks, covering hundreds of artists.

  2. (2)

    Categories: Artworks are categorized by art style (such as Impressionism, Modern Art, and Renaissance), artist, theme, and more.

  3. (3)

    Format: The image format is JPG (Joint Photographic Experts Group).

  4. (4)

    Content: The artworks include paintings, sketches, oil paintings, and sculptures, providing rich visual feature information.

The dataset can be downloaded from https://www.kaggle.com/datasets/akashdeepp/wikipedia-art. It is categorized into three types of data: paintings, sketches, and oil paintings, with each category containing 6,000 images. Therefore, this work is able to provide specialized training and evaluation for each art form, effectively improving the system’s accuracy, personalized recommendation ability, data balance for training, and the convenience of subsequent analysis. To ensure data quality, improve model training efficiency, and reduce data bias, this work preprocesses the WikiArt dataset. First, in the data cleaning stage, the dataset is filtered and deduplicated to remove missing labels, low-resolution images, and duplicate entries. Additionally, the Python OpenCV library is used for image grayscale detection to eliminate samples that are blurry, abnormally exposed, or contain excessive noise. Next, during image normalization, all images are resized to a uniform resolution of 512 × 512 to ensure data consistency and reduce model computational overhead. Pixel values of all images are normalized to the range [0,1] to meet the training requirements of neural networks and enhance model stability. To improve the model’s generalization ability, data augmentation techniques are applied, including random rotation, color jittering, mirroring, and random cropping. In particular, for artistic styles with limited samples, data augmentation is used to enhance the model’s ability to learn those styles effectively. Regarding data partitioning, a well-structured division into training, validation, and test sets is implemented to ensure data representativeness. The training set comprises 70% of the data for model learning, the validation set accounts for 15% for hyperparameter tuning and overfitting prevention, and the test set covers 15% for the final model evaluation. Additionally, to ensure the balance of artistic style categories in the dataset, this work uses a stratified sampling method to ensure that each artistic style maintains a similar proportion across the training, validation, and test sets. This can prevent certain style categories from being overlooked during training. For categories with fewer samples, oversampling techniques are applied to increase their weight in the training process, ensuring that the model can effectively learn the characteristics of all styles. Through the aforementioned data preprocessing and partitioning strategies, the optimized system maintains a good balance among data quality, style distribution, and computational efficiency, thereby enhancing the model’s applicability in art education and artistic style analysis.

Experimental environment and parameters setting

The experimental environment is set up as follows:

  1. (1)

    Processor Model: Intel Core i9-12900 K.

  2. (2)

    Graphics Processor Model: NVIDIA GeForce RTX 3090.

  3. (3)

    Memory Capacity: Corsair Vengeance LPX 64GB (2 × 32GB) DDR4-3200.

  4. (4)

    Motherboard Model: Samsung 970 EVO Plus 1 TB NVMe M.2 SSD.

  5. (5)

    Operating System Version: Ubuntu 20.04 LTS.

The work also adjusts the system parameters. In the parameter settings of the DL model, the system adopts a residual network as the feature extraction network to prevent the gradient vanishing problem and enhance training stability. By incorporating a transformer architecture with a self-attention mechanism, the system can capture a broader range of image features, improving the accuracy of artistic style recognition. For the style transfer task, a GAN-based approach is employed. The generator utilizes an improved CNN architecture with a self-attention mechanism to optimize style feature capture, while the discriminator adopts a patch-based discrimination method to ensure a more natural style transfer. During the style transfer process, content loss is introduced into the system to ensure that the transformed image can preserve the original composition and object structure. Style loss is employed to enhance the consistency of the target artistic style, and adversarial loss further optimizes the generator to ensure that the generated artwork can better meet artistic standards. In the artwork evaluation and scoring stage, the system integrates CNN with LSTM networks. By extracting spatial and temporal features, it enhances the rationality of the scoring process. Additionally, reinforcement learning is integrated to continuously optimize the scoring mechanism, ensuring the objectivity of the evaluation. During training, different learning rates are set for different tasks. The learning rate for the style recognition task is set to 0.001 to maintain training stability, while the learning rate for the style transfer task is set to 0.0002 to prevent mode collapse in the GAN. In terms of optimization algorithms, AdamW is used for the style recognition task to prevent model overfitting, while RMSProp is applied to the style transfer task to enhance training stability. Batch size selection is also optimized. A batch size of 32 is used for the style recognition task to balance computational stability and training speed. A batch size of 64 is set for the style transfer task to synchronize generator and discriminator optimization, improving the quality of generated images. The dataset is split 80:20 to ensure the smooth execution of the experiments.

The comparative systems selected for the experiment are Visual Perception Generative Adversarial Network (VPGAN), Artistic Recognition and Transfer Style Convolutional Neural Network (ARTS-CNN), and Deep Art Education and Recommendation Network (DEARNet). VPGAN is a type of deep generative model based on GAN, designed to enhance image style transformation capabilities. It consists of two main components: a generator and a discriminator. The generator, typically based on the U-Net architecture, comprises multiple layers of convolution and deconvolution, converting input images into output images with specific styles. The discriminator evaluates the style consistency and structural fidelity of images in the visual feature space, combining perceptual loss with adversarial loss. The key feature of VPGAN is its incorporation of a visual perception mechanism. This involves using a pre-trained deep feature extraction network to calculate content and style losses, guiding the generator to produce images with greater artistic expressiveness. This mechanism ensures that the generated images not only match the desired style but also retain the structural information of the original images, demonstrating high artistic quality in texture expression, brushstroke simulation, and light and shadow gradation. While VPGAN excels at balancing local details and global structure, its complex training process and high hardware resource demands make it less suitable for large-scale deployment in educational settings. ARTS-CNN is a multifunctional network system that integrates art style recognition with image style transfer. Its backbone structure is a CNN, often using architectures like AlexNet or shallow ResNet for feature extraction. It first identifies the art style of an image through a classification task and then performs a predefined style transfer based on the recognition result. ARTS-CNN offers strong real-time performance and stability in style recognition, making it suitable for style categorization, image retrieval, and style label recommendations in teaching. Compared to VPGAN, ARTS-CNN’s style transfer capabilities are more streamlined, typically involving feature space mapping and image reconstruction. This approach, however, limits its detail representation and style diversity. Nevertheless, its simple computational structure and moderate parameter count make ARTS-CNN well-suited for practical teaching applications such as instructional demonstrations, student practice, and basic creative guidance. It serves as a crucial technical support for the transition of traditional teaching systems towards intelligentize. DEARNet is a comprehensive deep learning recommendation system specifically designed for art education scenarios. It focuses on three core areas: art work evaluation, student profile building, and creation task recommendation. The evaluation module uses a lightweight CNN to automatically score student works, quantitatively assessing dimensions such as composition integrity, color matching, and art style. Based on the analysis of student works and incorporating multi-source data like past learning records, style preferences, and assignment quality, DEARNet constructs personalized learning profiles. It then employs a combination of collaborative filtering and content-based recommendation methods to intelligently push learning resources, creation tasks, and art style materials. In terms of interaction, DEARNet emphasizes human-computer collaborative teaching. The system dynamically adjusts recommendation content according to students’ creation trajectories, aligning with their learning progress and artistic expression tendencies. Compared to VPGAN and ARTS-CNN, DEARNet may have slightly weaker image generation capabilities. However, it has significant advantages in “teaching process assistance” and “creation feedback guidance.” It is particularly suitable for educational settings with a strong demand for personalized teaching support, such as primary and secondary schools and training institutions.

In summary, VPGAN, ARTS-CNN, and DEARNet respectively represent the technical approaches of the three mainstream systems for art style generation, style recognition, and art education recommendation. VPGAN emphasizes image generation quality and style fidelity, making it suitable for evaluating the generation advantages of the optimized system in this paper for style transfer tasks. ARTS-CNN represents the traditional style processing method based on classification and transformation, which can be used to compare the improvements of the system in this work in terms of style recognition accuracy and style transformation diversity. DEARNet provides a comprehensive user modeling and task recommendation system for art teaching, reflecting the overall performance of the system in this paper in terms of teaching intelligence, interactive feedback capability, and learning path design. By conducting experimental comparisons with these three systems across multiple dimensions, the advanced nature, completeness, and practical educational adaptability of the proposed optimized system can be comprehensively verified.

Performance evaluation

System performance comparison experiment

The performance evaluation experiment is performed. The comparison metrics are divided into four dimensions: model accuracy, generation quality, training efficiency, and style transfer effectiveness. Each dimension includes two comparison metrics. Figure 2 displays the accuracy evaluation results.

Fig. 2
figure 2

Comparison of model accuracy evaluation results (a) Classification accuracy (b) Precision.

Full size image

The results shown in Fig. 2 reveal that CIC (the optimized system) performs excellently in classification accuracy across all three art styles (painting, sketch, and oil painting). Specifically, CIC achieves an accuracy of 91.47% in the painting category, 90.56% in the sketch category, and 92.18% in the oil painting category. In comparison, VPGAN has accuracy rates of 84.65%, 82.43%, and 83.28% in painting, sketch, and oil painting, respectively. ARTS-CNN achieves accuracy rates of 88.23%, 86.74%, and 85.56%, while DEARNet’s accuracy rates are 89.56%, 88.21%, and 87.09%. In terms of precision, CIC also demonstrates strong performance. The precision for CIC is 89.02% for painting, 88.18% for sketch, and 89.85% for oil painting. In contrast, VPGAN’s precision is 82.51%, 80.39%, and 81.47%, respectively, for painting, sketch, and oil painting; ARTS-CNN’s precision is 85.29%, 83.11%, and 82.64%; and DEARNet’s precision is 87.15%, 85.72%, and 84.35%. Figure 3 shows the generation quality evaluation results.

Fig. 3
figure 3

Comparison results from generation quality dimension (a) Generative adversarial loss (b) Frechet inception distance (FID).

Full size image

The results shown in Fig. 3 indicate that CIC excels in generative adversarial loss, especially in the oil painting category, with a generative adversarial loss of 0.23, compared to 0.27 for painting and 0.25 for sketch. In contrast, VPGAN has generative adversarial losses of 0.54, 0.61, and 0.58 for painting, sketch, and oil painting, respectively. ARTS-CNN’s losses are 0.38, 0.41, and 0.39, while DEARNet’s losses are 0.31, 0.33, and 0.30. CIC’s FID scores across all categories are significantly lower than those of the other systems. For painting, sketch, and oil painting, CIC’s FID scores are 11.23, 10.87, and 9.74, respectively. In comparison, VPGAN’s FID scores are 19.42, 21.13, and 20.58, ARTS-CNN’s scores are 15.88, 16.92, and 17.45, and DEARNet’s scores are 13.72, 14.12, and 14.09. Figure 4 shows the comparison of training efficiency results.

Fig. 4
figure 4

Comparison results from training efficiency dimension (a) Training time (b) Memory consumption.

Full size image

The results shown in Fig. 4 demonstrate that CIC also exhibits higher training efficiency in terms of training time. Specifically, CIC’s training time for painting, sketch, and oil painting categories are 1402.17 s, 1425.61 s, and 1389.48 s, respectively. They are significantly better than VPGAN’s 1800.57 s, 1850.72 s, and 1820.29 s, ARTS-CNN’s 1585.45 s, 1613.89 s, and 1597.32 s, and DEARNet’s 1523.21 s, 1571.68 s, and 1547.89 s. CIC also performs well in terms of memory consumption, especially in the painting category, with a memory usage of 4.72GB, 4.56GB for sketch, and 4.60GB for oil painting. All of these are much lower than VPGAN’s 6.41GB, 6.58GB, and 6.49GB, ARTS-CNN’s 5.72GB, 5.94GB, and 5.81GB, and DEARNet’s 5.11GB, 5.28GB, and 5.17GB. Figure 5 shows the style transfer effectiveness evaluation results.

Fig. 5
figure 5

Comparison results from style transfer effectiveness dimension (a) Content loss (b) Style loss.

Full size image

The results shown in Fig. 5 indicate that CIC also demonstrates excellent performance in terms of content loss, particularly in the sketch and oil painting categories. CIC’s content loss is 0.17, 0.14, and 0.13, significantly lower than VPGAN’s 0.36, 0.38, and 0.37, ARTS-CNN’s 0.29, 0.32, and 0.31, and DEARNet’s 0.23, 0.26, and 0.25. CIC’s style loss also shows a clear advantage, especially in the oil painting category. CIC’s style loss is 0.23, 0.21, and 0.19, much lower than VPGAN’s 0.42, 0.46, and 0.44, ARTS-CNN’s 0.35, 0.37, and 0.36, and DEARNet’s 0.29, 0.31, and 0.30.

System practical effectiveness evaluation

To further verify the advantages of the optimized system, a real-world performance evaluation experiment is set up. Figure 6 presents the experimental results.

Fig. 6
figure 6

Results of actual effect evaluation experiment (a) Image quality evaluation (b) Computational performance and resource consumption (c) User experience and interactive feedback (d) Style transfer and artistic creation ability.

Full size image

The results shown in Fig. 6 reveal that CIC excels in every evaluation metric for image quality. Specifically, CIC’s image clarity is 0.89, far higher than that of the other models. In terms of detail performance, CIC scores 0.85, significantly higher than VPGAN (0.71) and ARTS-CNN (0.77), indicating that it better preserves the artistic details in the image. For artistic style consistency, CIC scores 0.87, outperforming VPGAN (0.76) and ARTS-CNN (0.80), demonstrating its ability to more accurately capture the target artistic style. Regarding color accuracy, CIC scores 0.91, again showing better performance than the comparison models, with a higher color reproduction ability. In terms of training time, CIC only requires 1500 s, faster than VPGAN (1800 s) and ARTS-CNN (1700 s), significantly improving training efficiency. In memory usage, CIC uses 4.9GB, contributing to better hardware resource management. For computational efficiency, CIC can generate 2.3 images per second, clearly outperforming VPGAN (1.4 images/second) and ARTS-CNN (1.7 images/second). Additionally, CIC’s Graphics Processing Unit (GPU) utilization rate is 70%, lower than the other models. This further indicates that its computational efficiency is higher, and its demand for hardware resources is lower. In terms of user experience, CIC also stands out. The user satisfaction score for CIC is 4.6, while its operation smoothness score is 4.8. Its response time is 0.8 s, faster than other models, providing a better real-time interaction experience. The interaction complexity score for CIC is 4.8, as its interface design is simple and intuitive, with fewer steps, making it easier for users to operate. In the evaluation of style transfer and artistic creation capabilities, CIC also shows strong performance. For content fidelity, CIC scores 0.83, higher than VPGAN (0.72) and ARTS-CNN (0.76). For style transformation quality, CIC scores 0.81, outperforming the other models. In terms of creative expression, CIC scores 0.80, incorporating more innovative elements into the style transfer process. For artwork visibility, CIC scores 0.85, demonstrating its expressiveness and appeal in artistic creation. Figure 7 presents the system stability and learning effectiveness evaluation results.

Fig. 7
figure 7

Actual effect evaluation (a) System stability and scalability (b) Learning outcomes and teaching support.

Full size image

The experimental results above clearly show that CIC demonstrates significant advantages in system stability, scalability, learning effectiveness, and teaching support, highlighting its broad application potential in art education.

Ablation experiments

To further verify the effectiveness of the optimized system in this paper across all modules, a set of ablation experiments were designed. By progressively removing key modules and observing changes in system performance, the contribution of each optimized component to the final experimental results is assessed.

The experiment removed four key components and compared their performance with the complete model (the optimized system in this paper). The ablation experiment models are as follows:

  1. 1)

    Complete model (the optimized system in this work, with all components retained).

  2. 2)

    Remove residual structure (replacing ResNet with a shallow VGG architecture).

  3. 3)

    Remove Generative Adversarial Network (using traditional style transformation methods, without generator and discriminator).

  4. 4)

    Remove PatchGAN architecture in the discriminator (using a standard full-image discriminator).

  5. 5)

    Removal of CNN + LSTM architecture (the scoring model retains only CNN, removing LSTM).

The ablation experiment results of style recognition accuracy are shown in Table 2:

Table 2 Ablation experiment results of style recognition accuracy (%).

Full size table

In the style recognition task, the complete model performs the best, with an average accuracy of 91.43%. When the residual network is removed and replaced with a shallow VGG structure, the accuracy drops to 86.03%, indicating that the residual structure plays a key role in capturing deep style features. After removing the GAN, the recognition accuracy further decreases, possibly due to the reduced quality of the generated data, which affects the style distribution learned by the model. Removing PatchGAN also leads to a slight drop in accuracy, confirming its effective modeling of local style details. Removing LSTM has little impact on the recognition task, suggesting that the scoring module is relatively independent of the classification module. The results of the generated image quality assessment are shown in Table 3.

Table 3 Evaluation of image generation quality.

Full size table

In terms of image generation quality, the complete model has the best performance with an average FID of 10.57, indicating that its generated images are closest to real artworks in terms of visual consistency and realism. After removing the GAN, the FID jumps to 18.53, showing a significant drop in quality and highlighting that the GAN is the core mechanism for maintaining generation quality. Removing PatchGAN increases the FID to 14.13, proving that PatchGAN is better at capturing local texture details compared to a full-image discriminator. Removing the residual network also leads to a decline in generation quality (FID of 13.03), indicating that reduced feature extraction depth affects the precision of the generated style. The results of scoring consistency are shown in Table 4.

Table 4 Scoring consistency.

Full size table

In terms of scoring consistency, the complete model has an average correlation coefficient of 0.83, indicating that its scoring results are highly consistent with expert evaluations. After removing LSTM, the average correlation drops to 0.73, showing that temporal modeling is crucial for the coherence and interpretability of the scoring logic. Removing GAN and PatchGAN also leads to a decrease in scoring consistency, reflecting a strong correlation between image quality and scoring consistency. The scoring consistency after removing the residual network is 0.70, slightly lower than with PatchGAN but higher than the GAN version, further confirming the importance of high-quality feature extraction for the input quality of the scoring module. The model training time assessment results are shown in Table 5.

Table 5 Model training time evaluation (s).

Full size table

In terms of training time, the complete model takes 1405 s to train, which is at a medium level among the ablation structures. Removing the GAN and discriminator significantly reduces the training time (to just 1178 s), reflecting the complexity and computational intensity of the GAN training process. Removing ResNet, LSTM, and PatchGAN each leads to a slight decrease in training time, indicating that while these modules slightly increase the computational burden, they play key roles in system performance and cannot be easily discarded. Overall, there is a significant trade-off between training time and model performance.

Discussion

The optimal experimental results achieved by the system proposed can primarily be attributed to several optimization strategies. They are improvements in network architecture, optimization of training methods, rational allocation of computational resources, enhancement of user experience, and strengthening of style transfer capabilities. Through these targeted optimizations, the system not only excels in image quality and style consistency but also surpasses other comparison systems in computational efficiency, resource consumption, interactive experience, and artistic creation capabilities.

First, in terms of network architecture improvement, the optimized system adopts a hybrid architecture combining CNN and transformers, with the introduction of GAN for optimization in the style transfer task. Compared to traditional CNN, the optimized network effectively extracts key features of artistic styles in the style classification task, and improves classification accuracy and precision. Particularly in the artistic style recognition task, the optimized system’s classification accuracy is notably higher than that of other models. This is due to the incorporation of residual networks and self-attention mechanisms, which allow the network to retain global style information while extracting deep features from images. This design provides the system with higher stability in various art style classification tasks, and also improves the overall style consistency and credibility of the style transfer task. Next, regarding training method optimization, the optimized system adopts a new combination of loss functions for style transfer tasks, including content loss, style loss, and adversarial loss. This intends to ensure that the generated artwork during the style transformation process retains the content information of the original image while accurately transferring the target artistic style. In contrast, VPGAN and ARTS-CNN often experience significant content structure loss or excessive style deviation during style transfer. The optimized system continually refines the style loss during the training of the generator and discriminator, making the generated images closer to real artwork. Moreover, the system employs an adaptive learning rate adjustment mechanism to dynamically adjust the learning rate during training, improving stability and convergence speed. This improvement significantly reduces adversarial loss and enhances the model’s generalization ability. In terms of computational resource allocation, the optimized system adopts a more efficient model architecture and optimization strategies, significantly reducing computational resource consumption while maintaining high performance. Experimental results show that the optimized system performs excellently in both training time and memory consumption, making it more suitable for training on large-scale datasets and reducing reliance on high-performance computing resources. This advantage is primarily due to two optimizations. One is the use of a lightweight network architecture that reduces the model’s computational complexity. The other is the optimized batch processing strategies that allocate computational resources more efficiently during training, thereby improving computational efficiency. In contrast, VPGAN and ARTS-CNN exhibit relatively lower computational efficiency when handling large-scale artistic datasets. This is due to the larger computational overhead of their generators and discriminators. In comparison, the optimized system shortens training time and uses computational resources more efficiently through its more efficient model design.

In terms of improving user experience, the optimized system makes significant optimizations in the interaction interface and user-friendliness. Experimental results show that the user satisfaction with the optimized system is much higher than that of other comparative models. This is primarily due to the system’s ease of use, smooth interaction, and personalized learning path recommendations. In contrast, the interaction design of VPGAN and ARTS-CNN is relatively complex, with a longer response time that affects the user experience. By enhancing the interaction interface, the optimized system enables users to better understand and adjust art style transformation parameters, while also reducing the number of steps involved, and improving the system’s use convenience. Additionally, the optimized system’s real-time feedback function can provide immediate suggestions during the user’s creative process, enhance learning efficiency and allow users to master art creation skills more quickly, thus further enhancing the system’s practical application value. Finally, in terms of enhancing style transfer capabilities, experimental results show that the optimized system performs excellently in content fidelity and style consistency. The system employs a multi-scale feature extraction-based style transformation method. It allows the converted images to not only retain the main content of the original image but also accurately simulate the target style’s brushstrokes, textures, and color features. In comparison, other models may experience excessive content loss, over-transformation of style, or insufficient detail representation during style conversion. However, the optimized system introduces adaptive style loss weight adjustment, making the style transformation of generated images smoother and more natural. Moreover, the system incorporates a degree of creativity during style transfer, making the generated works more expressive. In contrast, other models tend to have fixed style transformations, lacking innovation, which also leads to higher ratings for the optimized system’s artistic creation capability.

Compared to Kirova et al. (2023), this work demonstrates more significant improvements in style transfer and artwork generation quality. Their study primarily used traditional CNN-based style transfer, which performed well in style consistency, but the generated works were somewhat simple in terms of detail and lacked the ability to accurately reproduce the unique brushstrokes and textures of the target style34. The optimized system, by combining multi-scale feature extraction from transformers with GAN, enhances the finesse of style transfer, making the generated works more natural and visually closer to real artwork. Furthermore, this work uses an adaptive style loss optimization strategy, which preserves content structure while enhancing style expression, resulting in generated works that have higher quality and consistency across multiple artistic styles. Compared to Paesano (2023), this work also optimizes computational efficiency and user interaction experience. Their study used deep neural networks combined with traditional style transfer methods. Although their style transfer performance improved, their model had longer training time and slower generation speeds due to the lack of optimization in computational resources, limiting its real-time application in art education35. This work significantly reduces the computational resource consumption of the model through lightweight network architecture and parallel computing strategies, and maintains high generation quality even on devices with low computational power. Moreover, the work optimizes user interaction experience through a real-time feedback system and intelligent recommendation mechanism, improving the system’s suitability in teaching environments and making it better support art education and creation practices.

In summary, the optimized system has undergone targeted optimization in network structure, training methods, computational resources, user experience, and style transfer capabilities, making it superior to comparative systems across multiple performance dimensions. These improvements not only enhance the quality of generated artworks but also increase the system’s applicability in art education, making it an efficient, intelligent, and convenient art creation and teaching assistant tool. The experimental results fully validate the effectiveness of the optimized system, providing important reference value for the application of AI technology in the field of art education.

Conclusion

Research contribution

This work conducts an in-depth analysis of DL and AI technologies, and proposes an innovative art creation system, CIC. This system demonstrates significant advantages in image quality, style transfer ability, and computational performance. Through optimization of this system, the work effectively enhances the detail representation, artistic style consistency, and computational efficiency of generated artworks. This addresses the bottlenecks in quality and efficiency that traditional generative art systems face. Additionally, a novel method based on optimizing deep neural networks is introduced for art education, particularly in style transfer tasks. By optimizing the generative network, CIC not only accurately restores the target artistic style but also incorporates innovative elements to enhance the artistic expressiveness and creativity of generated images. This method provides art education with more creative and personalized art creation tools, and helps to improve the innovation and aesthetic abilities of art learners.

Future works and research limitations

The optimized system has achieved good experimental results in the field of art education, but there are still some shortcomings, mainly in terms of dataset limitations, adaptability to new art styles, and computational resource constraints. Future research needs to further optimize these issues to enhance the system’s generalization ability and practical application value.

First, the limitations of the dataset may affect the model’s generalization ability. The dataset used mainly comes from public art databases and some manually curated art pieces. While it covers common art styles such as painting, sketching, and oil painting, there are still data biases. The current data sources are relatively centralized, which may lead the model to perform more accurately for certain specific styles while struggling with emerging styles such as modern art and digital art. Additionally, the dataset’s support for unique styles like abstract, futurism, and pop art is still insufficient. Existing models may not be able to accurately capture the features of these styles, thus affecting the diversity of the generated results. On the other hand, although data augmentation techniques are applied during training, they mainly focus on basic transformations like color adjustments, rotation, and cropping, without considering the unique brushstrokes, textures, and other detailed features in artistic styles. Future research could expand the dataset’s scale and include artworks from different cultural backgrounds and artistic movements to reduce the impact of data bias. Furthermore, active learning mechanisms could be introduced, allowing the model to dynamically learn new art styles through user interaction and continuously optimize data collection and annotation, thereby improving the model’s applicability. Moreover, the optimized system still has some adaptability issues when dealing with new art styles. The current model’s training data and network architecture primarily target traditional art, making it less adaptable to digital art, mixed styles, and multi-style fusion artworks. When confronted with art pieces that have complex textures or dynamic features, the system’s generation quality may degrade. Additionally, the existing system’s style transfer uses fixed style templates for conversion. While it has optimized the balance between style loss and content loss, it still lacks adaptability. For example, on artworks with fuzzy or highly fused styles, the system may struggle to accurately discern style features, leading to unnatural generated results. Future research could introduce adaptive style modeling methods, such as using meta-learning or self-supervised learning to enhance the model’s generalization ability, allowing it to maintain high-quality generation results for new art styles. Additionally, multi-style fusion methods could be explored, enabling the system to learn and generate artworks with mixed styles, such as combining impressionism with modern abstract art, thus increasing the system’s creative diversity. Furthermore, computational resource constraints are a major challenge that affects the system’s practical application. Although this work’s optimized system has been improved in terms of computational efficiency and resource consumption, there are still certain computational constraints when it comes to large-scale applications and actual deployment. The current system runs efficiently at standard resolutions (such as 512 × 512 pixels). However, when generating ultra-high-resolution images (such as 4–8 K), the computational resource consumption increases significantly, and inference time also extends considerably, which could affect user experience in real-world art education applications. Additionally, the current model relies on GPU computing for style transfer and image generation, and the computing power of ordinary computers or mobile devices is limited, which restricts the system’s universality. Although this work optimizes computational resource allocation strategies, low-power devices may still experience slower operation speeds and increased response delays. Moreover, in large-scale educational applications, the system may need to support multiple users for style transfer and art generation tasks simultaneously, which poses higher demands on the server’s concurrent processing capacity. While the current model can operate stably in a single-machine environment, it may encounter computational bottlenecks under high-concurrency requests. Future research could explore model pruning and knowledge distillation techniques to optimize system running efficiency on low-power devices, enabling it to run on mobile devices or lightweight computing platforms. Additionally, cloud computing and edge computing technologies could be adopted to distribute some computational tasks and improve the system’s concurrency handling capabilities, adapting it to larger-scale art education application scenarios.

Lastly, this work lacks statistical significance testing, which may affect the robustness and reliability of the experimental results. Various performance metrics (such as classification accuracy, style loss, and computational resource consumption) are used to compare and analyze the optimized system. However, statistical methods like t-tests, ANOVA, or confidence intervals are not employed to verify the significance of the experimental results. This could lead to lower reproducibility of the results and difficulties in accurately assessing statistical differences between models. Future research could introduce more rigorous statistical analysis methods to perform significance tests on the results of multiple experimental groups, ensuring that the improvements of the optimized system are statistically reliable. Additionally, a larger test set could be used to minimize the influence of random factors on experimental conclusions, and improve the model’s stability across different datasets.

Overall, while the optimized system demonstrates high performance in art education applications, there are still issues such as dataset bias, insufficient adaptability to new art styles, and computational resource limitations. Future research can expand the dataset’s coverage, enhance the model’s adaptability, optimize computational efficiency and deployment strategies, and make the system more intelligent and efficient, capable of meeting more complex art creation demands. Additionally, exploring human-computer interactive art creation models and continuously optimizing the system based on user feedback will make the system more flexible in supporting art education and creative applications.

Back To Top