THE INFLUENCE OF THE PRINCIPAL COMPONENT ANALYSIS OF TEXTURE FEATURES ON THE CLASSIFICATION QUALITY OF SPONGE TISSUE IMAGES

The aim of this article was to determine the effect of principal component analysis on the results of classification of spongy tissue images. Four hundred computed tomography images of the spine (L1 vertebra) were used for the analyses. The images were from fifty healthy patients and fifty patients diagnosed with osteoporosis. The obtained tissue image samples with a size of 50x50 pixels were subjected to texture analysis. As a result, feature descriptors based on a grey level histogram, gradient matrix, RL matrix, event matrix, autoregressive model and wavelet transform were obtained. The results obtained were ranked in importance from the most important to the least important. The first fifty features from the ranking were used for further experiments. The data were subjected to the principal component analysis, which resulted in a set of six new features. Subsequently, both sets (50 and 6 traits) were classified using five different methods: naive Bayesian classifier, multilayer perceptrons, Hoeffding Tree, 1-Nearest Neighbour and Random Forest. The best results were obtained for data on which principal components analysis was performed and classified using 1-Nearest Neighbour. Such an algorithm of procedure allowed to obtain a high value of TPR and PPV parameters, equal to 97.5%. In the case of other classifiers, the use of principal component analysis worsened the results by an average of 2%.


Introduction
The development of information technologies gives a chance to apply them in more and more new areas. This possibility has led to many studies on the application of information technology also in medicine. Medical imaging is a dynamically developing area. Thanks to various methods of computer image analysis, it is possible to reduce diagnostic errors resulting from the limitations of the human eye and to discover mathematical dependencies of the image of the examined tissues.
One of the methods of image analysis increasingly used in research is texture analysis [1]. Texture represents image properties such as directivity (pattern direction) and porosity. On this basis, it is possible to distinguish images of tissues with lesions, as well as to designate areas of image that meet specific conditions [2].
As a result of the texture analysis, we obtain a set of up to 290 features of a given image that assume specific numerical values [4,12]. Some of these features are mutually correlated with each other or assume similar values for the images of tissue with lesions and healthy tissue [13]. The most valuable in the whole set are those features that assume different ranges of numerical values for the two groups. Identification of these features allows for their effective use in the classification process [5].
Due to the volume of data obtained during texture analysis, attempts are made to reduce them before building the classifier. The aim is to obtain the maximum information stored in the smallest possible data set. This allows to limit possible classification errors resulting from taking into account irrelevant features [6].
The two main techniques for reducing a set of features are feature selection or extraction [14]. The former consists in choosing the most important features from the entire set, which may become the basis for the construction of the classifier.
Depending on the selection method used, we distinguish a certain number of the most important features in the set [7]. Feature extraction allows to create a new feature space with a smaller dimension than the source space dimension [10]. The principal component analysis (PCA) method is one of the most frequently used methods of reducing the number of dimensions [9].
This article presents the results of the application of the principal component analysis method in the extraction of texture features of computer tomography images of the spongy tissue of the lumbar spine. The components obtained were used to build five different classifiers and the values obtained for each classification quality indicator were analysed. The obtained results were compared with the classification results for the set of 50 features of the tissue image selected during the ranking of feature importance.

Material
The research material was obtained from the results of computed tomography of the spine in the lumbosacral section (L-S) from 100 patients. Fifty of them belonged to a group without diagnosis of osteoporosis or osteopenia. The same number of patients was also in a group diagnosed as suffering from osteoporosis.
From the series of images showing the interior of the L1 circle with the spongy essence ( Fig. 1), 4 sections were selected. The images selected for further examination were saved in the BMP format. One image sample of the examined tissue was obtained from each of the selected cross-sections.
The size of the separated samples was selected to maximise the use of the surface of the texture containing the potential information contained in the image of the cross-section of the circle (Fig. 2). IAPGOŚ 3/2020 p-ISSN 2083-0157, e-ISSN 2391-6761 As a result, four hundred samples with dimensions of 5050 pixels were obtained. Sample images of tissue from healthy and sick patients are presented below (Fig. 3).

Method
The tissue samples obtained from the images were subjected to texture analysis. As a result, 290 features described by specific numerical values were obtained. The obtained features were ranked in order of importance of features from the most important to the least important. For further research, 50 features with the highest position in the ranking were used and subjected to principal component analysis.

Texture analysis
Image analysis was carried out with the MaZda program (version 4.6) [12]. This program allows to analyse the grey cardboard images and determine the numerical values of image features. The set of features has been obtained on the basis of:  histogram (9 features: histogram's mean, histogram's variance, histogram's skewness, histogram's kurtosis, percentiles 1%, 10%, 50%, 90% and 99%),  gradient (5 features: absolute gradient mean, absolute gradient variance, absolute gradient skewness, absolute gradient kurtosis, percentage of pixels with nonzero gradient),  run length matrix (5 features x 4 various directions: run length nonuniformity, grey level nonuniformity, long run emphasis, short run emphasis, fraction of image in runs),  co-occurrence matrix (

Distribution of feature significance
The 290 features obtained as a result of the texture analysis were used to create a ranking of importance. The ranking is aimed at selecting the features that best describe the differences between the studied groups and the rejection of correlated features. In the figure below (Fig. 4) there is a visualisation of the distribution of values of features. The first figure (A) shows the first features in the ranking and there is a clear difference in the distribution of the values of the features. Figure B shows the last features in the ranking.

Principal component analysis
The principal component analysis (PCA) algorithm is based on matrix calculation [3]. The goal is to find a matrix of principal components Y representing a matrix with input X in the new space [9].
Principal component analysis serves, among others, to reduce the number of variables or to identify patterns between variables. This method consists in determining the components which are a linear combination of the examined variables. The goal is to find new variables, the smallest possible subset of which will contain as much information as possible about the entire variability in the data set. The new set of variables creates an orthogonal basis in the feature space. Variables are selected in such a way that the first one represents as much variation as possible in the data [8].  The set of 50 features selected in the importance ranking was subjected to principal components analysis. As a result of this analysis, a set of 6 components was obtained. A visualisation of the distribution of their values is presented in Figure 6 and Figure 7 [8,9].
The values characteristic of the newly created components are presented in Table 1. Comparing these values allows us to see that the feature in the first position has the largest standard deviation and the largest range of values. The features placed on subsequent positions assume a smaller and smaller range of values and a lower value of the standard deviation.

Classification
Two sets of features were classified. The first one contained a set of 50 features occupying the highest positions in the importance ranking. The second set contained 6 new features obtained after using principal components analysis. Five types of classifiers were built:  Naive Bayes Classifier (NBC),  Multilayer Perceptron (MP),  Hoeffding Tree (HT),  1-Nearest Neighbour (1-NN),  Random Forest (RF).
To assess the quality of the classifiers used, the following factors characteristic in medicine were used:  general classification accuracy (ACC)probability of correct classification of cases into both categories,  true positive rate (TPR)determines the probability of correct classification of true sick cases to the sick group,  true negative rate (TNR)determines the probability of correct classification of true healthy cases to the group of healthy,  positive predictive value (PPV)identifies sick cases correctly assigned to a group of patients,  negative predictive value (NPV)defines healthy cases correctly assigned to the group of healthy patients.

Results
The classification results are presented in the tables below ( Table 2 and Table 3).
As a result of the classification carried out on the set of the first 50 features with importance ranking, the highest value of indicators was obtained for the Random Forest classifier. Among its indicators, the highest values were achieved by TPR and PPV (95%). The same values for TNR and NPV (94.50%) were also achieved for Multilayer Perceptron. The worst results were achieved by the Naive Bayes and Hoeffding Tree classifiers. In both cases, TPR and PPV were only 87.5%. In the case of the classification carried out for 6 components, the best results were obtained for the 1-NN classifier. The value of its TPR and PPV ratios is 97.5%. The ACC value was slightly less (96.75%). As in the case of the previous set, the worst results were obtained for the Naive Bayes classifier. The Hoeffding Tree classification achieved only half a percent better results for TNR and NPV. Comparing the obtained results, we can conclude that the application of the principal component analysis method allowed for the achievement of better final results of the 1-NN classification. The results of the best classifier for each set differ by 2% for ACC, 2.5% for TPR, 1.5% for TNR, 2.5% for PPV, and 1.5% for NPV. However, for the other classifiers, the results deteriorated. The most effective method of classification for the set of 50 features -Random Forestprovided 3% lower index results for the set of 6 components.

Conclusions
In the present experiment, the most effective data classification algorithm turned out to be the application of the 1-NN classifier to the set obtained as a result of principal component analysis. Such a procedure made it possible to obtain 2% better results than for the classification of the basic set of 50 features. In the case of other classifiers, different results were obtained, indicating a deterioration of the values of the classification indicators after using principal component analysis.
The above results indicate a limited usefulness of principal component analysis in improving the quality of classification. The application of this method improves the results of the work of selected classifiers. Building a diagnostic system based on the algorithm presented in the article may improve the diagnosis of the condition of the spongy tissue.