A comparison of conventional and deep learning methods of image classification Porównanie metod klasycznego i głębokiego uczenia maszynowego w klasyfikacji obrazów

The aim of the research is to compare traditional and deep learning methods in image classification tasks. The conducted research experiment covers the analysis of five different models of neural networks: two models of multi – layer per-ceptron architecture: MLP with two hidden layers, MLP with three hidden layers; and three models of convolutional architecture: the three VGG blocks model, AlexNet and GoogLeNet. The models were tested on two different datasets: CIFAR – 10 and MNIST and have been applied to the task of image classification. They were tested for classification performance, training speed, and the effect of the complexity of the dataset on the training outcome


Introduction
Nowadays, image classification methods play an important role in a wide variety of areas of life. Image classification is the process of extracting classes of information from a multiband bitmap, in other words, the problem of image classification is receiving an initial image and determine its class (cat, dog, etc.) or a group of probable classes that best characterizing the image. This paper presents a comparison of conventional and deep learning methods of image classification.
Multilayer Perceptron (MLP) is the most popular type of artificial neural networks. It is a class of feedforward artificial neural network. This type of network typically consists of one input layer, several hidden layers and one output layer. Each node in MLP is a neuron with a nonlinear activation function (except of input nodes). Although this type of network ignores the spatial information of the image, a lightweight MLP with 2-3 layers can easily cope with simple data sets like MNIST [1]. The MNIST is a voluminous database of handwritten numeral samples [2]. In the paper [3] MLP network with a single hidden layer was able to reach 43.4% of accuracy. The multilayer perceptron based architecture was once commonly used for computer vision, and is now increasingly being replaced by the Convolutional Neural Network (CNN) [3,4] and other machine learning methods. For example, the paper [5] compares MLP with other machine learning methods such as decision tree, logistic regression and support vector machine for solving image classification problems.
Artificial networks based on CNN architecture are considered to be universal [6], because they are used for a wide range of tasks, from botany [7,8,9,10] and geography [11] to medical diagnostics [12,13,14,15]. CNN-based models take into account the dimensional information of an image, which gives this type of architecture an advantage over networks with an architecture like MLP for image classification tasks. Another difference between MLP and CNN architectures is that layers in CNN not fully connected like in MLP. Convolutional neural network through the use of a special convolution operation allows to simultaneously reduce the amount of information stored in memory, due to which it copes better with higher-resolution pictures, and to highlight the reference features of the image, such as edges, contours or edges. At the next level of processing, from these edges and faces, you can recognize repeatable fragments of textures, which can then fold into fragments of the image. There are many types of convolutional neural network architectures and their modifications that have been developed to make the trained Journal of Computer Sciences Institute 21 (2021) 303-308 model perform better [16,13]. In the paper [17] proposed methods of the automatic designing CNN architectures using the Genetic image classification algorithm. Not only architectures are being modified, but also ways of solving problems. For example, the paper [18] shows how image segmentation techniques have evolved.
The process of learning a machine itself consists in preparing the appropriate data containing the necessary rules and a description of the object's properties, as well as selecting the optimal parameters for the model which is trained. These factors increase the impact of the selected training data set on training efficiency [19,10]. The data set is usually divided into several parts: training data, which is used to train the model, validation data, which is used by machine learning engineers during the design phase to tune the hyperparameters of the model, and test data is used to evaluate performance of the already trained model. Sometimes, validation data is used as test data. The number of images in the data set, their size, as well as the number of images in each of the categories by which we will classify them affect the training efficiency. As mentioned above, if the model is be trained successfully, it must be well parametrized and optimized. In the paper [20], the optimization problems faced by a machine learning specialist are described. According to the paper [21] choosing the correct activation function also plays a critical role in model training. The wrong selection of parameters can lead to overfitting or underfitting [22].
Underfitting is a situation when in a parametric family of functions it is not possible to find a function that describes the data well. The most common reason for underfitting is when the complexity of the data structure is higher than the complexity of the model that the researcher came up with. The solution to this problem is to complicate the model and find a better description of the effects that are in the data.
Overfitting is the opposite of underfitting when the model is too complex and universal. The error probability of the trained algorithm on the objects of the test sample turns out to be significantly higher than the average error on the training sample. There are techniques to avoid overfitting the model. For example, increasing the size of the training sample can help, if collecting more data is not possible, then various transformations (rotation, reflection, scaling, etc.) can be performed on an already existing set of images. Techniques such as cross validation, L1/L2 regularization also can help to avoid the problem of overfitting. One of the most effective techniques to prevent the appearance of the overfitting effect is to add dropout layers to the neural network architecture. By using dropout layers model ignore a subset of our network units with a given probability and reduce interdependent learning among units that could lead to overfitting. However, using dropout layers, it will take more epochs for our model to converge.
To predict how the trained model will behave in practice, the performance of the model is evaluated. Different performance metrics are used to evaluate the performance of different algorithms. Metrics such as Confusion Matrix, Accuracy, Precision, Recall, Specificity and F1 Score are commonly used for classification tasks. All of the above metrics use number of true positives, true negatives, false positive and false negative predictions. A true positive is when the model correctly predicted a positive class, and a true negative is when the model correctly predicts a negative class. False positive and false negative, respectively, are cases where the model incorrectly predicted a positive or negative class. Correctly selected metrics are the key to an accurate assessment of model performance.
To carry out this research work, a machine learning framework or library is needed. To solve the problems of image classification in this work, an open source Tensorflow library from Google was chosen. Tensorflow offers many out-of-the-box solutions that make learning model faster and easier. The API of Tensorflow library layer provides a simpler interface to commonly used layers in deep learning models. An example of the classification performance and qualitative analysis using the Tensorflow library can be seen in the paper [23]. A systematic overview of using TensorFlow for image classification can be found in the paper [24]. In this work, it is conducted an experiment that relies on classification performance and qualitative analysis of conventional and deep learning methods of image classification. The thesis of this study is "CNN obtains better performance in the task of image classification than MLP". Detailed research hypotheses are: 1. CNN based architecture give better accuracy than MLP; 2. models with MLP architecture give lower classification accuracy than CNN-based models when classifying color images; 3. CNN type networks train faster than MLP.

Research implementation
The research covered two tests. In the first one, it is checked whether CNN-type architectures give a higher classification accuracy than a multilayer perceptron, and also whether the choice of a black-and-white dataset affects the classification accuracy in the case of using a multilayer perceptron. The second test examines and compares the training speed for the neural network architectures studied in this article.
Two datasets were chosen for training and evaluating the models: MNIST Databasevolume set (60000 train and 10000 test images) of black and white handwritten numbers samples from 0 to 9 (ten classes) size of 28x28 and CIFAR-10 data set [25] consists of color images in 10 classes size of 32x32. There are 50000 training images and 10000 test images.
The evaluation performance of the model is carried out on the basis of the Accuracy and F1 score metrics, as well as the value of the loss function. Accuracy is way to measure how often the algorithm classifies a data correctly. Accuracy is the number of correctly predicted data points out of all the data points. F1 score is a metric for determining how accurate a test is. It is calculated using the test's precision and recall, with precision equaling the number of true positive results divided by the total number of positive results, including those that were incorrectly identified, and recall equaling the number of true positive results divided by the total number of samples that should have been identified as positive. The F1 score is calculated by taking the harmonic mean of precision and recall.
For the experiment, two models of the MLP type and three convolutional models were chosen: MLP with two hidden layers, MLP with 3 hidden layers, Three VGG blocks model, AlexNet and GoogLeNet.
The MLP is a feedforward neural network having a source neuron input layer, at least one hidden layer (two and three hidden layers in these cases) of computational neurons, and a computational neuron output layer. The input layer receives signals from the environment and redistributes them to all neurons in the hidden layer.
The basic idea behind VGG architectures is to use more layers with smaller filters. There are VGG-16 and VGG-19 versions with 16 and 19 layers respectively. In this experiment, a model with three VGG layers is implemented.
The AlexNet architecture consists of five convolutional layers, between which pooling layers and normalization layers are located, and three fully connected layers complete the neural network.
The GoogLeNet is a deep architecture with 22 layers. The goal was to develop a neural network with the highest computational efficiency. To do this, Google came up with the so-called Inception modulethe entire architecture consists of many such modules, following one after another. The idea behind the main Inception module is that it is itself a small local area network. All his work consists in the parallel application of several filters to the original image. The filter data is combined to create an output that goes to the next layer.

Test 1 implementation
Each model was trained in a loop 20 times to maximum accuracy for MNIST and CIFAR-10 data sets, and the training results (Accuracy, Loss, F1 score) were recorded in a csv format file for further analysis. Chart of accuracy and loss and confusion matrix chart were saved on disk. SGD optimizers and data generators were used for all models. At the beginning of each loop step, a random seeds were set, the training data set was randomly splitted into training (80%) and validation (20%) subsets, a new instance of SGD optimizer, data generator and model were created and compiled. After train-ing, the model was saved to disk, and then its object was deleted, and the session was cleaned.

Test 2 implementation
All models were trained with the same settings as in section 2.1. 10 times up to 20 epochs on MNIST data set. Accuracy and loss of first, fifth, tenth, fifteenth and twentieth epochs, as well as accuracy, loss and F1 score for test data set were recorded in the csv file at each loop step.

Results of first test
During the analysis of the results, the mean value of the accuracy, loss function and F1 score were calculated for each model. Mean loss for each model without division into a data set are presented in the figure 1.   As can be seen from the results the MLP-type architectures presented in this paper do a good job with a simple black-and-white MNIST dataset, but their classification accuracy and F1-score for the color CIFAR-10 dataset is much lower than that of the CNN-type architectures. However, the loss function value for both MLP architectures is lower than the value for GoogLeNet, but higher than for another two CNN-type architectures presented in this paper (Three VGG blocks and AlexNet). The three VGG blocks architecture showed the best results in terms of accuracy, loss function and F1 score. AlexNet architecture shows third best results in terms of accuracy and F1 score and second best result in term of loss function.

Results of second test
During the analysis of the results, the mean value of the final accuracy, loss function and F1-score were calculated. Also the mean value of the loss function, validation loss function, accuracy and validation accuracy for first, fifth, tenth, fifteenth and twentieth epoch were calculated. Figure 7 shows the change of the classification accuracy for each model during training.    The results show that initially AlexNet trains the slowest of all and, in terms of classification accuracy, catches up with MLP-type models only closer to the fifteenth epoch. GoogLeNet in the first epoch has worse results than MPL-type models, but quickly overtakes them after the fifth epoch. The three VGG blocks model has the best training speed and retains it throughout each epoch.

Conclusions
The aim of the study was to compare the performance of convolutional and traditional neural network architectures. During the research for this thesis, familiarization with machine learning and deep learning issues was required. Convolutional networks are the most widely used neural networks used for image classification tasks, and it was concluded during this study that CNNtype networks are the best choice for this purpose due to the accuracy of the classification and the smaller loss function, than MLP-type architectures. In terms of training speed, three VGG blocks network showed the best results while AlexNet showed the worst. These results are due to the filter size influencing the training speed. As mentioned in section 2, the VGG models use small filters and that's why three block VGG model have best results in training speed test. For the same reason, it cannot be argued that convolutional neural networks train faster than MLP-type networks. It also was confirmed that MLP networks give lower classification accuracy than CNN-based models when classifying color images, but give higher accuracy for simple black and white images comparing to them classifying color images. The obtained results partially confirm the main thesis put on beginning of work.