Optimized Multi-Hierarchical Feature Fusion with Multi-Kernel CNN and Spectral-Spatial …

12 September 2025
colind88
News Feed

Introduction

Remote sensing image classification is a fundamental step in understanding and interpreting Earth observation data, underpinning numerous applications from land-use analysis to environmental monitoring [1]. Over the years, researchers have developed a variety of approaches to classify high-resolution aerial and satellite imagery. Early methods relied heavily on hand-crafted features and traditional classifiers, for example, texture, colour, or shape descriptors fed into support vector machines or decision trees [2]. While these conventional strategies achieved reasonable results, they often struggled with the complex patterns and high dimensionality inherent in remote sensing data. Manually designed features generally fail to capture multi-scale structures and the rich spectral information of such images, limiting classification accuracy, especially as image resolution and spectral diversity increase [3]. This limitation prompted a shift toward more automated, learning-based techniques. In recent years, the advent of deep learning has dramatically advanced remote sensing image classification, mirroring its success in general computer vision [4]. Convolutional neural Networks (CNNs) have demonstrated an exceptional ability to learn hierarchical feature representations directly from raw pixel data, eliminating the need for explicit feature engineering [5]. By stacking multiple convolutional layers, CNN-based models automatically extract low-level edges and textures in early layers and progressively build up to high-level semantic features in deeper layers. This hierarchical learning enables a more robust characterisation of complex scenes, leading to significantly improved classification performance on remote sensing datasets [6]. Numerous CNN architectures and training strategies have been explored for scene classification and object recognition in remote sensing imagery, consistently outperforming traditional methods [4]. Despite these gains, several challenges remain that modern deep Networks must address to fully realise their potential in this domain. One major challenge is the diversity and complexity of remote sensing scenes [7]. Remote sensing images often exhibit high inter-class similarity and notable intra-class variability due to factors like perspective, seasonal changes, and varying spatial resolutions. For instance, different land cover types (e.g., forests, shrublands, and agriculture) can appear remarkably similar in certain spectral bands, while the same class (such as “urban”) can encompass a wide variety of structures and textures [8]. Standard CNNs may have difficulty distinguishing such subtle differences or generalising across scale variations. Multi-scale feature representation has thus become an important research focus. Recent works have shown that fusing features at multiple scales or hierarchical levels can provide more discriminative information for classification [9]. By integrating fine-grained local details with broader contextual cues, multi-scale feature fusion techniques help the model recognise both small objects and global scene layouts more effectively. However, naive implementations of multi-hierarchical fusion can introduce redundancy or excessive computational cost, indicating the need for an optimised fusion strategy. Another challenge arises from the rich spectral information available in modern remote sensing data, especially with hyperspectral or multispectral imagery. Such data consists of dozens or even hundreds of spectral bands, capturing subtle material features that are invaluable for classification [10]. Traditional two-dimensional CNNs process multi-band images by treating spectral bands as additional input channels, which can inadvertently mix spectral information without fully exploiting the relationships between bands. To address this, researchers have developed spectral–spatial convolution techniques that explicitly account for both the spectral dimension and the spatial context [11]. For example, specialized framework modules may apply separate convolutions along the spectral dimension (to extract features from band-wise features) and along the spatial dimensions (to capture textural patterns), later combining them. These spectral–spatial approaches have proven effective at improving classification accuracy for hyperspectral images, as they preserve important spectral details while still leveraging spatial structure [12]. Yet, incorporating spectral–spatial operations into a deep framework can increase model complexity, and balancing this with efficiency is an ongoing concern. Given the above considerations, there is a clear need for a more efficient and accurate model that can jointly leverage multi-scale hierarchical features and spectral–spatial information for remote sensing image classification [13]. In this work, we propose an optimized multi-hierarchical feature fusion framework that integrates multi-kernel CNN modules with spectral–spatial depthwise convolutions to address the shortcomings of existing approaches. The multi-kernel design employs convolutional filters of varying sizes within each layer, enabling the framework to capture diverse-scale features and textures in parallel [14]. This design mimics the idea of combining multiple receptive fields, so both small details and broader patterns are learned concurrently. Meanwhile, the spectral–spatial convolutions decompose the feature extraction process into spectral and spatial components, ensuring that the model effectively utilizes the information across all bands without a bloated parameter count [11]. By fusing the outputs of these components across multiple hierarchical levels of the framework, the proposed model creates a rich, multi-level feature representation of the input image. This comprehensive feature fusion strategy helps in distinguishing subtle class differences and enhances overall classification robustness.

While deep learning has undeniably advanced the state of remote sensing image classification, existing approaches still fall short in several areas. One critical issue is the incomplete exploitation of spectral-spatial relationships. Many models process spatial features through CNNs while treating spectral information as auxiliary or preprocessed input, missing the opportunity to jointly model the interactions between spatial structure and spectral variation. This often results in underperformance, particularly in scenes where class boundaries depend on subtle spectral differences. Another persistent challenge lies in the rigid structure of conventional CNNs, which rely on fixed kernel sizes. This limits their ability to adapt to diverse object scales and shapes commonly found in remote sensing imagery, ranging from small vegetation patches to large urban zones. As a result, features at certain scales may be overlooked, reducing classification accuracy in complex scenes. In addition, deep CNNs with large parameter counts can be computationally expensive, especially when processing high-resolution or multi-band images. This creates barriers for real-time or large-scale deployment, where efficiency is as important as accuracy. Compounding this issue is the lack of built-in mechanisms in many models to focus on the most relevant features or suppress noise, which is often present due to sensor limitations or environmental variation. Furthermore, the generalisation ability of existing models remains limited. Many deep learning models perform well on specific datasets or regions but fail to transfer effectively across different geographical areas or sensor types, reducing their practical value for broader Earth observation tasks.

To overcome these limitations, this study introduces an Optimised Multi-Hierarchical Feature Fusion framework. The following are this manuscript’s originality and main contribution:

❖
By integrating kernels of varying sizes within each convolutional layer, the model captures a broader range of spatial features, from fine textures to global structures, in a single pass.
❖
The model decomposes feature extraction into two paths (spectral and spatial), allowing for efficient, independent learning of spectral features and spatial textures.
❖
Features from different levels of the framework are selectively fused based on their contribution to classification performance. This optimised fusion mechanism ensures that both local details and global semantics are retained in the final representation.
❖
By employing depthwise convolutions and avoiding redundant fusion paths, the proposed model significantly reduces parameter count and computational cost, enabling scalability to larger datasets and deployment on lower-power devices.

These innovations aim to deliver a more robust and efficient solution to the challenges of high-resolution, multi-spectral remote sensing image classification. The subsequent sections of this paper are structured in the following manner. Section 3 discusses in detail the materials and methodologies employed in this manuscript, including the proposed model, dataset, and data preprocessing and evaluation metrics. Section 4, which is the results and analysis, the implementation hyperparameters, proposed model forecast results, components analysis of the proposed model and result comparison with the state-of-the-art models on the same dataset, result discussion and limitations, and future works were all discussed. The conclusion is done in section 5.

Section snippets

Related Works

The development of RSI classification has advanced rapidly with the integration of deep learning techniques, especially CNNs. These models are highly effective at learning hierarchical spatial features, but many still face challenges in fully capturing spectral information, adapting to varying feature scales, or maintaining computational efficiency. Earlier studies, such as Cheng et al. [15], utilised discriminative CNNs to improve class separability using metric learning, yet their approach

Materials and Methodology

The general flowchart of the experiment involved in this study involves several steps, as seen in Figure 1. First, the dataset is loaded, and then data pre-processing, such as resizing, augmentation, and normalisation, is carried out. The Data Loaders for the training set and the validation set are created before initialising the proposed model, including its loss function and optimiser. To encourage generalisation, scheduling and early stopping are employed. The training loop is executed by

Result and Analysis

The section presents the experimental results and analysis in this manuscript, including the results and analysis, backbone selection analysis and proposed model component analysis. It concludes with a discussion of the results, identified limitations, and future work.

In the evaluation of pretrained models for remote sensing image classification using a Multi-Hierarchical Feature Fusion framework with Multi-Kernel CNN and Spectral-Spatial Convolutions, ResNet50 emerged as the top-performing

Conclusions

In this study, we introduced an Optimised Multi-Hierarchical Feature Fusion Framework for remote sensing image classification, integrating multi-kernel convolutions, spectral-spatial depthwise convolutions, and residual learning within a ResNet-50 backbone. This approach effectively addresses the challenges of capturing complex spatial patterns, multi-scale object variations, and rich spectral dependencies in high-resolution remote sensing imagery. Evaluated across six benchmark datasets, the

CRediT authorship contribution statement

Chiagoziem Ukwuoma: Writing – original draft, Software, Methodology, Data curation, Conceptualization. Dongsheng Cai: Supervision, Funding acquisition. Chidera Ukwuoma: Visualization, Validation, Project administration. Qi Huang: Supervision, Project administration, Funding acquisition. Oluwatoyosi Bamisile: Writing – review & editing, Formal analysis. Chibueze Ukwuoma: Visualization, Validation, Investigation. Chinedu Otuka: Software, Project administration. Nnadozie Anyanwu: Validation,

Informed Consent

All participants included in the study provided informed consent.

Declaration Of Interest Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. All authors read and approved the final manuscript.

Data Availability

Ethical Approval

None of the authors conducted studies involving human participants or animals in this article.

Ethical Statement

We, the authors of the manuscript titled “Optimized Multi-Hierarchical Feature Fusion with Multi-Kernel CNN and Spectral-Spatial Convolutions for Remote Sensing Image Classification,” confirm that this work adheres to the highest ethical standards in research and publication. The study was conducted with integrity, transparency, and respect for scientific principles, as outlined below:

1.
Originality and Authorship: This manuscript represents original work conducted by the listed authors. All

Declaration of Competing Interest

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

The authors gratefully acknowledge the support of the National Natural Science Foundation of China (NFSC, Grant No. 52007025) and the Science and Technology Support Program of Sichuan Province (2022JDRC0025).