Skip to main content

International Journal of Interdisciplinary Research

Enhanced content-based fashion recommendation system through deep ensemble classifier with transfer learning


With the rise of online shopping due to the COVID-19 pandemic, Recommender Systems have become increasingly important in providing personalized product recommendations. Recommender Systems face the challenge of efficiently extracting relevant items from vast data. Numerous methods using deep learning approaches have been developed to classify fashion images. However, those models are based on a single model that may or may not be reliable. We proposed a deep ensemble classifier that takes the probabilities obtained from five pre-trained models such as MobileNet, DenseNet, Xception, and the two varieties of VGG. The probabilities obtained from the five pre-trained models are then passed as inputs to a deep ensemble classifier for the prediction of the given item. Several similarity measures have been studied in this work and the cosine similarity metric is used to recommend the products for a classified product given by a deep ensemble classifier. The proposed method is trained and validated using benchmark datasets such as Fashion product images dataset and Shoe dataset, demonstrating superior accuracy compared to existing models. The results highlight the potential of leveraging transfer learning and deep ensemble techniques to enhance fashion recommendation systems. The proposed model achieves 96% accuracy compared to the existing models.


Several sectors and technology, including Recommender Systems (RS), have been profoundly affected by the COVID-19 pandemic. RS are comprised of algorithms that offer users tailored recommendations according to their preferences, actions, and other pertinent information. RS are vital in facilitating user exploration and discovery of similar items. In the realm of the fashion business, fashion recommendation has emerged as a prominent research area, focusing on discerning individual fashion preferences. To address this, recent advancements in visual search have empowered users to search for items using images captured by the camera or retrieved from the gallery (Dagan et al., 2023). The utilization of photographs is an integral component of online purchasing. Previous studies predominantly relied on human evaluations to establish metrics for analyzing the impact of images on consumer behavior, which imposes limitations on the range of variables and research samples that can be investigated (Wang et al., 2021). Visual perception plays a fundamental role in human comprehension of the world, encompassing aesthetic elements that influence consumer behavior, managerial decisions, employee actions, and investor choices. Visual information can be derived from various sources, including the physical environment, videos, and images. Product images are significant in online shopping as they provide buyers with essential visual information. However, internet shoppers face the challenge of making judgments without the ability to touch and feel the actual products physically.

Product images remain the primary method of presenting products and influencing purchases, even though methods like virtual changing rooms are being developed (Chaudhary et al., 2019). Multimedia artifacts, including photographs, movies, audio/speech, etc., have multiplied dramatically due to the web and multimedia technology’s phenomenal expansion and development over the past two decades. Image retrieval has developed into a problematic issue in the multimedia sector due to the large volume of data. As a result, search algorithms are becoming more complex to return photos most pertinent to user queries. Results from multimedia search engines still fall short of what users want. Digital images and other visual materials are now more widely available, particularly on the Web, the most significant image database. Yet, how simple it is to look for and manage multimedia content will determine its value. So, the need for effective picture indexing, storage, and retrieval is growing, especially on the Web (Tekli, 2022). RS has been used in a variety of fields, including music (Deldjoo et al., 2020; Hansen et al., 2020; He et al., 2018; Wundervald, 2021), movie recommendation (Davidson et al., 2010; Du et al., 2018; Ma et al., 2022), recommending fashion products (Heinrich et al., 2021; Sun et al., 2022; Sysko-Romańczuk et al., 2022; Zeng et al., 2019), novels, and medicine. The main objective of these systems is to help consumers rapidly and effectively in other areas. The RS acts like filters; they filter the information to assist consumers in discovering better goods, financial strategies, and additional related information by customizing the suggestions. Several online customer service providers, including social networking and e-commerce websites, use a recommender system as a vital feature to boost their audience and revenue.

RS are fundamental components in various domains, commonly relying on three primary approaches: content-based, collaborative filtering, and hybrid models. The content-based approach centers on leveraging item properties to make recommendations. In contrast, collaborative filtering focuses on establishing connections between items and users. The combination of content-based and collaborative filtering techniques can be hybrid (Suvarna & Padmaja, 2019). Notably, a content-based approach does not depend on user ratings. Instead, it can extract similarities by analyzing the features of new products. In this approach, a single user search query can be sufficient to generate item recommendations.

The collaborative filtering approach aims to find similar users and recommends the items. A method for Collaborative Filtering is as follows: If user A selects the objects of X, Y, and Z and user B selects the items of W, Y, and Z, then we can recommend W to A and X to B. This filtering technique first finds similar users and suggests the items based on the most similar user likes. Different similarity measures can be used, like Jaccard, Cosine, and Centered Cosine or Pearson correlation (Gomez-Uribem & Hunt, 2015; Xu et al., 2021). The problem with this approach is an inadequate set of user interactions, which can be called cold-start and data sparsity problems (Zhang et al., 2019). Cold start problem where there are fewer details about the new user and new item. In such cases recommending items or similar users is a big problem (Wang et al., 2018). Another interesting problem is sparsity, where the obtained ratings are significantly less than the suitable rating (Elahi et al., 2016) and also predicting the rating for a particular product from the user (Gupta & Gadge, 2015; Koohi & Kiani, 2016). A hybrid technique combines the two strategies mentioned above. Most of the RS concern often use a hybrid approach (Sivaramakrishnan et al., 2021).

Enough research has been done related to the RS. Initially, machine learning models were utilized in these systems (Balaji et al., 2021; Portugal et al., 2018). Nowadays, the majority of image classification techniques use supervised machine learning, in which models are trained to predict the query image class based on a single class level. Traditional computational models are one type of computer vision technology that uses intricate, manually designed computation algorithms are used to extract important features from particular regions of an image. Yamamoto and Nakazawa (2019) used a Support Vector Machine and concatenated the features obtained from multiple CNNs to experiment with FashionStyle14. 9. Sejal et al. (2016a; 2016b) proposed a Morkov-related model to suggest the images based on the search history. They performed clustering to group the images and applied cosine similarity to recommend similar users. Sejal et al. (2017) proposed an ANOVA cosine similarity for image recommendation based on the search. Here their assumption is the user is online, and the system must give recommendations based on the search query. They conducted experiments on the Myntra dataset. Jayalakshmi et al. (2022) used machine learning algorithms like clustering and Principal Component Analysis (PCA) for classifying the movie and recommendation. Initially, machine learning models were utilized in these systems (Balaji et al., 2021; Portugal et al., 2018).

However, as the field progressed, Convolutional Neural Networks (CNNs) and other deep learning models have emerged as powerful tools for improving recommendation accuracy (Tahmasebi et al., 2021). These models gain the capacity to automatically extract spatial characteristics from a large number of images by optimizing the parameters through a combination of convolutional and pooling layers. The last layer of the network predicts the class label as per the obtained features. Until an acceptable accuracy is reached, supervised learning algorithms optimize model parameters over several epochs using labeled data to learn spatial features. Sheikh Fathollahi and Razzazi (2021) proposed two CNN models: one for extracting features and another for classifying the music genre. They used two distance measures, Euclidean and Cosine similarity, to recommend the music. There they have not used any collaborative filtering. Indira et al. (2022) proposed a model that uses the CNN model for feature extraction. The features are passed to the residual network to get the recommendations. Nocentini et al. (2022) worked on different CNN models to see the performance of three datasets. Ullah et al. (2019) use a deep learning approach, a group of 5 convolution layers for feature extraction. The extracted features are passed to the Random Forest classifier to get the class label. They performed the task in two phases one was with a direct Random Forest Classifier to predict the model, and later used a deep learning model to improve the accuracy.

However, all the above-discussed models are based on a single model. Secondly, these models often recommended items without accurately identifying the specific product. In this research, we focused on developing a content-based recommendation system that eliminates the need for users to rate specific products. Instead, we aimed to classify new items based on the decisions obtained from multiple models and retrieve similar images within the same product category. To achieve this, we employed various pre-trained models leveraging Transfer Learning techniques. Furthermore, we introduced a novel deep ensemble classifier designed to classify fashion images. The retrieval of similar products was accomplished using cosine similarity.

The main contributions of this study are outlined as follows:

  • A new approach that combines multiple pre-trained models to create an ensemble classifier. This ensemble classifier improves the accuracy and robustness of the classification process.

  • A range of pre-trained models is employed with unique features and characteristics. By leveraging Transfer Learning, we extracted knowledge from these models and assigned appropriate weights to enhance the overall classification performance.

  • To assess the effectiveness of our approach, Fashion product images and Shoe datasets are used for the experimentation.

  • By testing our deep ensemble classifier on these datasets, we obtained empirical evidence of its performance and demonstrated its potential for practical applications in the fashion domain.

This research introduces a novel content-based recommendation system for fashion products, featuring a deep ensemble classifier and leveraging Transfer Learning techniques. Our approach demonstrates promising results through experiments on benchmark datasets, indicating its potential for real-world fashion recommendation scenarios.

The rest of the sections delve into the methods and techniques employed thus far in developing the recommendation system. The related work of the recommendation system is presented in Section. “Literature review”. Section “Methods” offers our novel approach to enhancing the recommendation system. Section “Results” reports the experimental results and findings from evaluating our recommendation system. In the final section of this research paper, we provide a conclusive summary of our work and its contributions to the field of recommendation systems.

Literature review

CNN related works for recommendation

CNNs and other deep learning models have emerged as powerful tools for improving recommendation accuracy. Hiriyannaiah et al. (2022) proposed a convolutional autoencoder for classification and combined the different similarity metrics using a boosting approach for recommending similar products. Cosine, Manhattan, Euclidean, Pearson Correlation, and Tanimoto coefficient similarity metrics are used and combined, all boosting methods. They implemented the proposed model on four different datasets. Gharaei et al. (2021) created a DNN model for the gender and item classification of a given image. First, they classify the gender of a given image, and the last item is classified. Based on the gender and item category, recommendations will be provided. The cosine similarity metric recommends the products for a given image. Jo et al. (2020) proposed and developed a deep learning model for the search for fashion products on Amazon product dataset. Tuinhof et al. (2019) designed and developed a neural network for product classification and applied the model to a fashion product dataset. Elleuch et al. (2021) created a deep CNN model for image classification to conduct the experiments on clothing datasets.

Suvarna and Balakrishna (2022a) have proposed and designed a novel deep learning-based ensemble classifier for recommending fashion products. Results from testing the algorithm on a dataset of fashion products are promising. This allows for product recommendations with an accuracy of 88.32%.

Suvarna and Balakrishna (2022b) have designed and implemented a deep CNN model that is both effective and efficient in categorizing the product in query. The results of the evaluation of the proposed model utilizing a dataset consisting of fashion products turned out to be satisfactory. Because of this, it is possible to give recommendations for the items that are both accurate and reliable, with an accuracy percentage of 89.09%.

Transfer learning

The process of using knowledge obtained for one task for a related task is known as transfer learning. Asiroglu et al. (2019) created two models based on Inception for the prediction and the other for recommendation. Jang et al. (2019) extracted gender and clothing features from a pertained ResNet50 backbone network and pulled another set of elements from a pertained VGG16 that had its final three fully connected layers removed. Zhang et al. (2023) created a shallow neural network that takes the inputs obtained from 4 different pre-trained models like AlexNet, InceptionV3, ResNet50, and VGG16. Wakita et al. (2016) created a Deep Neural Network (DNN) for a fashion brand recommendation system. Choudhary et al. (2023) use a deep learning-based recommendation system that uses backpropagation neural networks with a variety of nodes and numerous hidden layers to enable quick learning. A small number of representative deep learning architectures with varying numbers of hidden layers are included in this paper to enhance the model's learning capacity. Ay et al. (2019) and Seo and Shin (2019) created a Hierarchical Convolutional Neural Network (HCNN) for apparel classification. With CNN and a knowledge-embedded classifier that outputs hierarchical information, hierarchical categorization of clothing is applied in this study. Additionally, condition-CNN learns the correlation between various class levels as conditional probabilities, which are then utilized to estimate class predictions in the scoring process. Condition-CNN requires fewer trainable parameters than the baseline CNN models but achieves a higher prediction accuracy by feeding the estimated higher-level class predictions as priors to the lower-level class prediction (Kolisnik et al., 2021). By this model the article type classification accuracy is 91% on the Fashion product images dataset.

However, our analysis of the existing literature revealed several limitations that need to be addressed. Firstly, All the above discussed models predominantly relied on machine learning approaches and deep learning models and also primarily utilized limited datasets for their models. Secondly, these models often recommended items without accurately identifying the specific product. Lastly, the results obtained from these studies were based on the performance of a single model, neglecting the potential benefits of ensemble methods. To overcome these limitations, we proposed a novel two-stage content-based recommendation system that leverages deep learning techniques. This system aims to enhance the accuracy and effectiveness of fashion image classification and recommendation, addressing the above mentioned issues. With this approach, the system can give the recommendations corresponding to the given query image only. In this research, we comprehensively investigated the application of Transfer Learning in various pre-trained models. We focused on developing an advanced deep ensemble classifier designed explicitly for fashion image classification. The retrieval of similar products was observed by using various similarity measures and finally retrieved by using cosine similarity.


This section discusses the problem definition, proposed model, candidate model, Sparse DNN model, and a deep ensemble classifier. Recommending the products can be done in different ways. One is based on collaborative filtering, which is the user who likes the products that can be recommended to similar users; the other is content-based Filtering. When it comes to content-based Filtering, research is going on again in two phases one is by suggesting the k similar items from the cluster, and the other is by recommending items from the product class (Deldjoo et al., 2018; Schedl et al., 2018; Wundervald, 2021). We adopted the process of recommending k items from the class of product dataset, which is similar to the model approach used in two phases one is for classification, and the other is for extracting similar images.

Problem definition

In preparation for addressing the research problem, we conducted experiments utilizing a Fashion product images dataset. Our approach involved extracting features from the data using various pre-trained models. To decrease the dimensionality of the features that were extracted, we employed PCA. Additionally, we employed several similarity measures to facilitate the retrieval of similar images for the recommendation. We used DenseNet201 as the feature extractor, and PCA was subsequently applied (Fig. 1).

Fig. 1
figure 1

Content-based recommended system

Figure 2(a) is used as input. The similar images retrieved for the test product using different similarity metrics like cosine, Manhattan, and Euclidean are presented in Fig. 2(b), (c) and (d), respectively. From Fig. 2(b–d), it is observed that the images unrelated to the test product are retrieved. To avoid these unnecessary retrieved results, we propose to perform the classification for the given image. From the given user input image and image database, the main objective is to recommend similar items of the same product. The flow of the work is explained in Fig. 3.

Fig. 2
figure 2

Retrieving the similar images for test product (a) using Cosine similarity (b), Manhattan (c) and Euclidean similarities (d)

Fig. 3
figure 3

Proposed model for classification and recommendation process

This study proposes a cutting-edge deep ensemble method for categorizing products that learn predictions from potential CNN models to categorize products more accurately. For this purpose, transfer learning models DenseNet, Xception, Mobile Net, and two other variations of VGG16 and VGG19 are used to fine-tune the fashion images. Probabilities obtained by the different models are passed as inputs to the deep ensemble classifier, which can use the final prediction and recommend the items like the predicted class using the cosine similarity measure. The architecture of the proposed model is explained in Fig. 4.

Fig. 4
figure 4

The architecture of the Novel Deep Ensemble classifier

Pre-processing module

After loading the data, divide it into two groups: the training and testing datasets. Extract the features using all the pre-trained models and store them separately. Load the image into the target size of (224, 224). Convert each image to an array and pre-process it. After that, reshape the image to get the form of (number of images, 224, 224, 3). Pre-trained models are the primary model for the candidate model. This module's main goal is to take dataset images and derive semantic spatial representations from them. This study uses different models such as DenseNet, Xception, Mobile Net, and other variations of VGG16 and VGG19. Pass the features obtained by the pre-trained model to train the Sparse DNN. The process repeats for all the pre-trained models. Save the fine-tuned models. Use the fine-tuned models to get the probabilities. Concatenate all the prediction probabilities obtained by each model and pass those to the Deep Ensemble classifier to get the final possibilities to understand the model performance. Use the ensemble model to predict the class label of the given test product. First, we classify the given product using a cutting-edge deep ensemble classifier, and then we extract the top k comparable photos for recommendation purposes.

Sparse DNN architecture

This study used transfer learning techniques VGG16, VGG19, DenseNet, Xception, and Mobile Net. In addition, the Sparse DNN architecture is used to get the probabilities for each model, as shown in Fig. 5. Images from the data collection are first subjected to a pre-processing module for the normalization of features and the obtained features need to be reshaped as per the requirements of pre-trained models. Semantic feature maps are retrieved from these features after they are processed into a frozen convolutional basis using imagenet weights. Rather than being provided directly to the proposed classification head, the high dimensional spatial feature maps are compressed using the Global Average Pooling (GAP) layer.

Fig. 5
figure 5

Sparse DNN Architecture

Here we take three dense layers followed by dropout layers with 512, 256, and 128 neurons, respectively. We use a 0.2 dropout rate to avoid the model overfitting. The final layer is dense with softmax. In our scenario, one hundred and forty-three neurons have been employed because there are 143 classes for the large dataset. A set of 12 neurons for the apparel dataset with 12 classes and 6 neurons for the Shoe dataset is used. The suggested ensemble classifier is fed with the probabilities generated by the softmax layer.

Given a 2-D image I as input, the mathematical expression involved in applying the convolution operation using 2D-kernel K is in Eq. (1)

$$S\left( {m,n} \right) = \left( {I{*}K} \right)\left( {m,n} \right) = \mathop \sum \limits_{i} \mathop \sum \limits_{j} I\left( {i,j} \right)K\left( {m - i,n - j} \right)$$

The activation function used is ReLU between the layers which is in Eq. (2)

$$S\left( {x_{i} } \right) = \max \left\{ {0,x_{i} } \right\}$$

Here the function takes the max value when the x exceeds zero. Otherwise, it is 0.

Batch Normalization is applied before dense with softmax layer. It is used to equalize the inputs to each layer and its equation in (3)

$$\overline{{x{ }}} = \frac{{x_{i} - {\text{Batch mean}}}}{{\sqrt {{\text{ batch variance}} + \varepsilon } }}$$

Here the weights are optimized using the following equation. Here we divided the learning rate with the history. Which is shown in Eq. (4).

$$W_{{t{ }}} = W_{{t{ }}} - \frac{\eta }{{\sqrt {\widehat{{V_{t} + \varepsilon }}} }}\hat{m}_{t}$$

where the \(\hat{m}_{t}\) and history update is given in Eqs. (5) and (6)

$$\hat{m}_{t} = \frac{{m_{t} }}{{1 - \beta_{1}^{t} }}$$
$$\hat{v}_{t} = \frac{{v_{t} }}{{1 - \beta_{2}^{t} }}$$

Categorical cross-entropy is used as a loss function as the data has multiple classes to classify. The mathematical formula for the categorical cross-entropy is in Eq. (7)

$$\mathop \sum \limits_{i = 1}^{n} y_{i} log\hat{y}_{i}$$

The probabilities connected to a multinoulli distribution are frequently predicted using the softmax function shown in Eq. (8)

$$softmax\left( {S_{i} } \right) = \frac{{{\text{exp}}\left( {S_{i} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{n} {\text{exp}}\left( {S_{j} } \right)}}$$

The proposed model is explained as step by step in Algorithm 1.

Algorithm 1
figure a

Deep Ensemble classifier for the prediction task

Deep ensemble classifier

Ensemble methods are mainly categorized into sequential ensemble techniques and similar ensemble techniques. Ensemble classifiers improve performance by learning from multiple models rather than one model. Ensemble algorithms work by merging various models into one model. This model increases the accuracy as it learns things from other models. Stacking, bagging, and boosting are the commonly used approaches in the ensemble. These models are suitable for classification and regression tasks as they increase the high accuracy and decrease the bias-variance. This model's drawback is that it ignores the label's confidence element and only takes into account the final prediction label. A deep ensemble classifier that takes the inputs as confidences produced by the multiple candidate models.

In this work, predicted probabilities from five different candidate models are obtained and passed to the proposed ensemble classifier with five input layers. These five input layers follow the fusion layer to get the best features from the input layers. Later three fully connected layers are placed, with 800,500 and 150 each. Each following a dropout of 0.2 and all the components of the network is regularized to avoid overfitting. Softmax connects the final dense layer with the k-class number of neurons. The model uses Adam as an optimizer for optimizing the parameters and sparse categorical cross entropy is used as a loss function. We used early stopping to prevent the model from overfitting.


In this section, we delved into the details regarding the datasets employed to review the model, the experimental environment in which the evaluations were conducted, the evaluation metrics used to assess the model's performance, and a comprehensive analysis of the obtained results.


Three datasets are used to observe the performance of the model. The first dataset is the Fashion product images (Large) dataset, the second one is the Fashion product images (Apparel) dataset which is only the 12 classes, and the third one is the Shoe Dataset taken from the Kaggle (Aggarwal, 2019; Yogesh, 2021). The summary of datasets is described in Table 1. The Fashion product images (Large) dataset contains 44 k images with 143 classes. The extracted Apparel dataset from the large dataset has 12 classes. Shoe dataset is downloaded from the kaggle, it has 6 classes where each class contains 249 images. The details of Fashion product images (Apparel) dataset and shoe dataset are provided in Table 2.

Table 1 Details of dataset used in experimentation
Table 2 Description of Fashion product images (Apparel) dataset and shoe dataset

Experimental environment

Experimental studies use the Windows operating system, specifically Windows 10, version 21H2. The hardware setup includes an Intel(R) Xeon(R) Silver 4114 CPU with a clock speed of 2.20 GHz and an NVIDIA Quadro RTX 5000 graphics card. The experiments primarily utilize the CPU's processing power and the GPU's computational capabilities. To facilitate the execution of the experiments, scripts are developed and written in the Python programming language, which allows for efficient implementation and control of various experimental procedures and data analysis. This combination of hardware and software components provides a robust and versatile platform for conducting experiments in a controlled and efficient manner.

Evaluation metrics

In our work, the main objective is to perform classification and similarity measurement tasks using a specific model. To assess the effectiveness and accuracy of this model, we evaluate its performance using classification metrics. These metrics provide a quantitative analysis of how well the model can classify different instances or data points into predefined categories or classes. By evaluating the model's performance using classification metrics, we can gain insights into its ability to identify and assign instances to the correct categories accurately. This evaluation allows us to measure important aspects such as precision, recall, accuracy, and F1 score, which provide a comprehensive understanding of the model's classification capabilities.

Furthermore, in addition to classification, our work also focuses on similarity measurement. Similarity measurement quantifies the likeness or resemblance between different instances or data points. By incorporating similarity measurement into our evaluation, we can determine how well the model can capture and represent the similarities between some other cases, which is crucial in various domains such as information retrieval, recommendation systems, and clustering.

Classification metrics

The metrics employed for the analysis of the proposed method, in addition to classification accuracy, are precision, recall, and F1-score. Equations for accuracy, precision, recall, and F1 score are in [912].

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
$${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
$$F1{ }Score = \frac{{2{*}\left( {Precision{*}Recall} \right)}}{Precision + Recall}$$

Similarity measures

Typically, the recommendation system requires a similarity matrix to recommend the products to the user. There are many ways to measure the similarity between two products: Euclidean Distance, Manhattan Distance, Murkowski Distance, Hamming Distance, etc. Many works were carried out on similar metrics, and most authors worked on Cosine similarity and Euclidean distance. Scikit Learn is a library in Python that has a cosine similarity function. This function yields a matrix with similarity scores between one item and the other. Sort the scores and recommend the items or products with the highest similarity score. Cosine similarity, Manhattan, and Euclidean distances are used in this experimentation, and the equation for Cosine similarity, Manhattan, and Euclidean distances between points A and B has shown in Eqs. (1315) respectively.

$$\cos \theta = \frac{A.B}{{A B}}$$
$$Dm = \mathop \sum \limits_{i = 1}^{n} \left| {A_{i} - B_{i} } \right|$$
$$D_{e} = \sqrt {\mathop \sum \limits_{i = 1}^{n} \left( {A_{i} - B_{i} } \right)^{2} }$$

Result analysis on Fashion product images (Apparel) dataset

This section discusses the results obtained during various pre-trained models for the Apparel dataset, which contains 12 classes of data, and extensive data, which includes 143 types. The size of the initial dataset, including the attributes, is 44,000 × 11. Apparel data for 12 categories are listed in Table 3 under the class column. Using the article type, extract 12 classes of image data, and combine all 12 styles. 14,795 × 2 was used to create a new data frame with the name and type of the image file.

Table 3 The portion of the data in the Fashion product images (Apparel) dataset

For our initial trials, we have considered the Fashion product images dataset. In this work, predicted probabilities from 5 different candidate models are obtained and passed to the proposed ensemble classifier with five input layers. These five input layers follow the fusion layer to get the best features from the input layers. The proposed ensemble model receives initial predictions from the regularised classification head, which are then used as inputs. After splitting the data, the number of samples used for training and testing purposes is given in Table 3. These details are very much required to understand the performance analysis of different models. Performance analysis of different candidate models on the fashion product dataset is described in Tables 4 and 5.

Table 4 Performance analysis of VGG19, Xception, and MobileNet approach on the Fashion product images (Apparel) dataset
Table 5 Performance analysis of VGG16 and DenseNet on Fashion product images (Apparel) dataset

The performance of VGG19, Xception, MobileNet, VGG16, and DenseNet is given in Tables 4 and 5, respectively. In addition, the graphical illustration of the performance outcome of various models in the form of precision, recall, and F1-Score has been shown in Figs. 6, 7, 8 respectively. From the obtained results, we can observe that the items in class 0 are not correctly classified as there are fewer images in the data. There are 15 images in that class, and only the three products are available in the test data. MobileNet, VGG16, and VGG19 are not classified correctly for at least one item in that class. Class 1 is predicted fully with MobileNet and DenseNet with 100%, whereas VGG19 and Xception values are the same and low VGG16.Class 2 is fully classified with DenseNet, followed by VGG16 and VGG19. Class 3 has the highest accuracy with VGG19, followed by VGG16 and DenseNet. However, less precision is obtained with Xception and MobileNet. Class 4 Track pants is classified well with DenseNet. At the same time, class 5 Dresses are classified fully with VGG16. DenseNet has the highest accuracy with class 6 Trousers, with 105 images to classify. Class 6 Shorts are classified with almost 99% with Vgg16 followed by Exception. MobileNet classified the Jeans category with 95% accuracy. Tops are classified as 89% with VGG16; the lowest classifier is Xception with 75%. Shirts have the highest classification with 98% accuracy, VGG16 and DenseNet predicted with 97% accuracy. T-shirts are properly classified as 97% with VGG16 and MobileNet.

Fig. 6
figure 6

Class-wise performance comparison of various pre-trained models on the Fashion product images (Apparel) dataset in Precision

Fig. 7
figure 7

Class wise performance comparison of various pre-trained models on the Fashion product images (Apparel) dataset in Recall

Fig. 8
figure 8

Class-wise performance comparison of various pre-trained models on the Fashion product images (Apparel) dataset in F1-Score

Result analysis on Fashion product images (Large) dataset and Shoe dataset

The proposed model is applied on a Fashion product images (Large) dataset that contains 143 classes. Model-wise overall performance along with the deep ensemble classifier is reported in Table 6 as it is very difficult to analyze the class-wise information as it has many classes. Additionally, we included the outcomes of several alternative models in Table 6 for comparison purposes. This table provides a comprehensive overview of the findings from applying different candidate models to the Fashion product images (Large) and Shoe dataset. By examining the results presented in Table 6, one can gain insights into the relative effectiveness and performance of the various models in handling this dataset.

Table 6 Performance comparison of various models on the Fashion product images (Large) and Shoe dataset

Figure 9 provides a graphical illustration of performance outcomes on various fashion and shoe dataset models. In this MobileNet, DenseNet and VGG16 performed well compared to other models like VGG19 and Xception. The classification accuracy of VGG19 was the lowest (87.15%), and MobileNet had the highest accuracy (88.9%). The models VGG19 and Xception look similar. MobileNet performs better than other models in terms of classification measures. The classification accuracy with MobileNet is high on Shoe data. Later DenseNet and Vgg16 worked well. However, the models VGG19 and Xception look similar in both datasets.

Fig. 9
figure 9

Overall performance comparison of various pre-trained models on Fashion product images (Large) dataset and Shoe dataset

We extend our experimental experiments using the Shoe dataset to assess the efficiency of the suggested ensemble strategy with small-size datasets. The number of samples and samples available in each class is described in Table 7. Results of multiple baseline candidate model training on the Shoe dataset are shown in Tables 8, 9.

Table 7 Shoe dataset and portion of samples in test data
Table 8 Performance outcomes of MobileNet, Xception, and VGG16 models on the Shoe dataset
Table 9 Performance outcomes of DenseNet, and VGG19 models on the Shoe dataset

Tables 8, 9 show the performance of MobileNet, Xception, VGG16, DenseNet, and VGG19, respectively. In addition, Figs. 10, 11, 12 presents the performance outcome of various models in the form of precision, recall, and F1-Score respectively on the shoe dataset. Class 0, Flip_Flops, is well classified with VGG16 and DenseNet. However, the rest of the models do not perform well the class 0. VGG19 and DenseNet performed well with Class 1 Soccer Shoes. MobileNet and VGG19 give the best performance for Class 2 Boots. Class 3 Sneakers are predicted correctly with VGG19 and MobileNet. Class4 Loafers items are indicated accurately with MobileNet. However, it was wrongly predicted with VGG19. Class 5 sandals are predicted correctly with DenseNet. However, MobileNet and DenseNet are performing well with the Shoe dataset. With this dataset, a MobileNet had the maximum accuracy (81.1%), whereas all other models could only achieve accuracy levels of less than 80%. In contrast to VGG19, Xception, DenseNet, and VGG16 models produce results with higher accuracy. According to these findings, the MobileNet architecture is more suited to a short dataset with reduced misclassification rates for product recommendation.

Fig. 10
figure 10

Class-wise performance comparison of various pre-trained models on the Shoe dataset in Precision

Fig. 11
figure 11

Class-wise performance comparison of various pre-trained models on the Shoe dataset in Recall

Fig. 12
figure 12

Class-wise performance comparison of various pre-trained models on the Shoe dataset in F1-Score

Results obtained with deep ensemble classifier

Analysis of results obtained with deep ensemble models is described in Table 10. For the Fashion product images (Apparel) dataset, Shirts, T-shirts, Shorts, Jeans, and Track Pants are classified accurately. In contrast, Trousers, Dresses, Jackets, and Skirts are classified as reasonably good, but Waistcoat, Stockings, and Tops are not appropriately predicted. Related to the Shoe dataset, sandals are correctly categorized, and the rest of the products are reasonably good.

Table 10 Performance outcome of deep ensemble classifier on Fashion Dataset and Shoe Dataset

We also verified which combination of models gives the best accuracy. For the initial task, we combined MobileNet and DenseNet, which provides an accuracy of 94.09, with MobileNet and VGG19 also getting the same precision. When DenseNet is combined with MobileNet and VGG19, accuracy is improved, and when four models are combined, there is no improvement in accuracy. Finally, the accuracy reaches around 96% when all the models are combined.

Table 11 compares the proposed model's outcome with other existing works on the fashion dataset. The accuracy of the proposed model increases to 36% when compared with the model (Gharaei et al., 2021), and 30% increase with the existing model (Nocentini et al., 2022), a 10% rise compared with the model (Suvarna and Balakrishna 2022a), a 7% rise compared with the model (Suvarna and Balakrishna 2022b). 5% raise when compared with the proposed model (Kolisnik et al., 2021). We tested the proposed model with three different types of data scuh as Fashion product images (Apparel), other one with Fashion product images (Large) and Shoe data. Even with a massive amount of data, our model still works well. Our model is performing well even with comprehensive data. A comparison of the proposed model with other works can also be seen in the following Fig. 13. The input images a,b,c, and d are used as shown in Fig. 14. The recommended products for the given test image using different similarity measures are presented in Figs. 1526. After classifying the given product, we extracted similar images by applying the cosine similarity measure to features already obtained during the pre-processing phase. Here we considered the features with the highest accuracy obtained with pre-trained models.

Table 11 Comparison of Fashion product images dataset with existing works
Fig. 13
figure 13

Comparison of the proposed model performance with recently published deep learning models

Fig. 14
figure 14

Fashion product images used for query images (a) Top,  (b) Waistcoat, (c) Bag, (d) Shoe

Fig. 15
figure 15

Retrieved results for fashion image-14(a) using Cosine similarity

From the above results, one can observe that the retrieved results match the original product. Manhattan and Euclidean retrieving results will be less than 80% matching, whereas the retrieved results are above 96% with cosine similarity. However, we also extracted the results of the query image are majorly based on the color and pattern of the query image Fig. 14(a) which can be observed in Figs. 15,16, and 17. In Fig. 14b the query image is top with checks patterns, and the recommendation system provides the top with checks and gestures of the person also. If there are the same color images, then it is presenting first, and later it is looking for the pattern and gesture of the image based on the similarity measure. The same pattern can be found in Fig. 18a–d, 19a, d, 20b, c. In Fig. 14c the given query image is a bag with green color. The obtained results produced by the different similarity measures can be observed are similar in Fig. 21a, b, 22d, 23c, d. 1n Fig. 14d the query image is a shoe with white color and yellow and black stripes. The system is recommending the same in Fig. 24d, 25a, c, 26b, d. Also, we grouped the obtained results based on the pattern and color of the query images.

Fig. 16
figure 16

Retrived results for fashion image-14(a) using Euclidean distance

Fig. 17
figure 17

Retrieved results for fashion image-14(a) using Manhattan distance

Fig. 18
figure 18

Obtained results for fashion image-14(b) using cosine similarity

Fig. 19
figure 19

Obtained results for fashion image-14(b) using Euclidean distance

Fig. 20
figure 20

Obtained results for fashion image-14(b) using Manhattan distance

Fig. 21
figure 21

Obtained results for fashion image-14(c) using cosine similarity

Fig. 22
figure 22

Retrieved results for fashion image-14(c) using Euclidean distance

Fig. 23
figure 23

Obtained results for fashion image-14(c) using Manhattan distance

Fig. 24
figure 24

Obtained results for fashion image-14(d) using cosine similarity

Fig. 25
figure 25

Obtained results for fashion image-14(d) using Euclidean similarity

Fig. 26
figure 26

Obtained results for fashion image-14(d) using Manhattan distance

User Evaluation: We have selected 100 users for the manual testing. Where every user is supposed to select 10 images of their interest from the database for the given image. On the other hand, the results are retrieved by using different similarity measures. The items selected by the user and results retrieved by the proposed model are 90% matching. This shows the proposed model is producing the products with 90% confidence for the given image to the user.


In the context of the COVID-19 pandemic, where the online retail industry has witnessed substantial growth, our research contributes to the advancement of RS for fashion. By accurately categorizing fashion images and providing personalized recommendations, our approach can cater to the evolving needs and preferences of consumers during these challenging times. In this research study, we introduce a novel approach for predicting the category of fashion images by utilizing a deep ensemble classifier. The proposed model leverages multiple candidate models, including DenseNet, Xception, MobileNet, and two variations of VGG16 and VGG19, to fine-tune the fashion images effectively. By employing transfer learning techniques, the deep ensemble model is trained to classify fashion products and subsequently retrieve similar items from a comprehensive database, thus enabling the development of a fashion recommendation system. To evaluate the performance of the proposed model, three datasets such as Fashion Products Images (Apparel), Fashion Products Images (Large) and a Shoe dataset are used to conduct the investigations. Through comparative analysis, we demonstrate that our proposed method significantly improves predictive accuracy compared to existing approaches. The deep ensemble classifier effectively captures fashion images' complex patterns and features, allowing for more accurate and reliable categorization.

Our findings highlight the potential of deep ensemble models in fashion image classification and recommendation systems. Integrating multiple candidate models enhances the overall predictive power, enabling more robust and accurate classification results. Furthermore, the successful retrieval of similar fashion items from the database demonstrates our proposed approach's practical applicability and potential utility in real-world scenarios. The results of this study contribute to the growing body of research in computer vision and fashion recommendation systems. The demonstrated improvement in performance underscores the effectiveness of our deep ensemble classifier in tackling the challenges of fashion image classification and recommendation. Future research directions may include exploring additional candidate models and evaluating the proposed approach on more extensive and diverse datasets to validate its effectiveness and generalizability.

Availability of data and materials

Data in this research paper will be shared upon request with the corresponding author.


Download references


I am very thankful to VFSTR Deemed to be University, for providing the research infrastructure for my research work.


Not applicable.

Author information

Authors and Affiliations



Conceptualization, BS; methodology, BS and SB; validation, BS and SB; formal analysis, BS. and SB; writing—original draft preparation, BS; writing—review and editing, BS, and SB, supervision, SB, and BS. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Buradagunta Suvarna.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suvarna, B., Balakrishna, S. Enhanced content-based fashion recommendation system through deep ensemble classifier with transfer learning. Fash Text 11, 24 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: