USAGE OF ARTIFICIAL NEURAL NETWORKS IN THE DIAGNOSIS OF KNEE JOINT DISORDERS

. Following article address the issue of automatic knee disorder diagnose with usage of neural networks. We proposed several hybrid neural net architectures which aim to successfully classify abnormality using MRI (magnetic resonance imaging) images acquired from publicly available dataset. To construct such combinations of models we used pretrained Alexnet, Resnet18 and Resnet34 downloaded from Torchvision. Experiments showed that for certain abnormalities our models can achieve up to 90% accuracy.


Introduction
Knee joint disorders are a problem strictly combined with the human aging process.Such disorders are outcomes of everyday work and accidents that lead to physical damage.One of the most effective diagnose methods of such injuries is the analysis of MRI images.
In this project we tried to construct a hybrid neural network architecture that could possibly accurately classify knee joint abnormalities using MRI images uploaded by Stanford University as "A Knee MRI Dataset And Competition" [8].The researchers from Stanford ML Group also published an article [1] presenting results of their models which served as a reference point to scores achieved by our neural nets.We would also acknowledge the fact that for better understanding of our task we analysed Ahmed Besbes's implementation available here [5].
The goal of the whole project is to check whether a single person's exam consisting of 3 planes (axial, coronal and sagittal) indicate the occurrence of an injury like abnormality, ACL tears or meniscal tears.Each exam was viewed and tagged with labels by medical doctors.

Fig. 1. Single exam's planes
This article is constructed as follows.At the beginning of the text we describe the dataset and the idea standing behind the experiment.After that we present the types of used neural nets and statistical methods.The last part of the paper is dedicated for experiment's results and summary.

Dataset characteristics
The dataset consists of 1 370 MRI examinations taken at Stanford University Medical Center.Each of the examination has 3 labels indicating presence of abnormality, ACL tears and meniscal tears.Occurrence of ACL tears or meniscal tears means that the abnormality label will be positive but it doesn't work the other way round.That means that the abnormality label covers not only ACL tears and meniscal tears but also other types of abnormalities not specified among labels.
MRI images were taken using various devices (GE Discovery, GE Healthcare, Waukesha, WI).Moreover two types of magnetic fields were used: 3.0 T (55.6% of exams) and 1.5 T for the rest of the exams.
Data uploaded by Stanford University was already preprocessed.That included converting DICOM (Digital Imaging and Communications in Medicine) files to png format and rescaling them to 256256 resolution.Given that the images didn't have the same pixel intensity the researchers used standardization algorithm which based on pixel intensity taken from training dataset.The algorithm itself was run on both training and testing dataset.
In order to enhance the training dataset we performed augmentation consisting of random rotation, transposition and horizontal flip.

Experiment description
For each plane we attempted to build a single model (called submodel) which specialized in a specific label.To boost the performance of models we decided to use pretrained versions of Resnet18, Resnet34 and Alexnet downloaded from Torchvision [7] which served us as main parts of our submodels.Overview of their structures are available here: [2,3].The idea standing behind single net's functioning was to process the outcome of the pretrained model with average and max pooling, concatenate the results and finally perform calculations using fully connected layer.Whole structure is presented in Fig. 2. The best suited model for classifying presence of specific label using given plane was chosen based on its performance and the results of McNemar's test run between all models.
The main models were composed out of 3 submodels each taking as input specific plane.Outcome of each submodel was sent to linear layer which ended up with 2 neurons.(1) where x stands for predicted value, y for actual value, σ for logit function and w for weight loss.
To balance unequal label distribution we multiplied losses from actual positive observations with reversed proportion of number of actual positive observations to number of actual negative observations.This operation can be describes as follows:

Resnet
Resnet is a type of a neural network created by a group of researchers from Microsoft.Their main objective was solving the issue of degradation which takes place at the training process.The symptoms of this phenomenon appeared as deteriorated loss values not only on training set but also on test set whenever the construction of neural net was expanded with extra layers.
As a result, the researchers invented residual block which at that time differed from standard neural networks with the idea of using input vector at the beginning and at the end of set of layers.

Alexnet
Alexnet is a type of neural network created by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton.Its success was based on several of innovations, whose Alexnet's creators were not always authors of, that were used all together in one architecture.Alexnet's training was spread around 2 graphic cards which allowed updating of 2 parallel mapping series and efficient memory management.Alexnet's structure begins with 3 convolutional layers which share data between graphic cardsinput mapping for each of layer is structured from output tensors created by the previous layers placed on both graphic cards.Next 2 layers are convolutional type but in this case they are independent in sense of data sharing.They are followed by 3 fully connected layers.

McNemar's test and Wilson's confidence interval
For The null hypothesis states that both classifiers disagree to the same extend.If the null hypothesis is rejected, it means that there's a possibility that both classifiers disagree in a different way.To perform McNemar's test p value should be calculated which, depending on alfa value (here 0.05), confirms or rejects the null hypothesis: p value > 0.05 → null hypothesis confirmed, p value ≤ 0.05 → null hypothesis rejected.
In many cases chi square distribution is impossible to estimate since b + c < 25.Because of that reason we decided to use the exact p value given by the following equation: = • 0.5  • (1 − 0.5) − (9) where n = b + c.
In order to estimate the values of expected metrics on unseen before data we calculated Wilson confidence interval.A comprehensive overview is available here [9].This method allowed us to construct range of expected results with given probability (95% in this project).Wilson confidence interval is given by the following equation: 4 2 (10) where n is the number of observations, z is z score for 95% confidence interval and ̂ is the number of positive observations.

Results of submodels
For each label, for each plane, and for each pretrained model we performed a training process which lasted 10 epochs.Among 10 checkpoints we selected the one which obtained the highest accuracy on test set at the end of epoch.If at least 2 checkpoints achieved the same accuracy, then we chose the one which had the highest AUC (area under curve) result.At the end of the selection process we ended up with 27 submodels which we had to reduce to 9one submodel for (plane, label) pair.
To analyze submodel's performance we calculated the following statistics: accuracy, precision, recall, F1 score and AUC.To take a deeper look into delivered statistics we computed Wilson confidence interval.Furthermore we calculated p values using McNemar's test to find out whether submodels are statistically different.
Tables 3 and 4 present an example of results obtained by 3 submodels dedicated for abnormality classification using axial plane.In this case because of the high p values we decided to move forward with Resnet18 which has the least number of parameters.We would also like to present insight into training process of chosen model.Fig. 7 and Fig. 8 show us that, even though all models were pretrained beforehand, the loss levels reached during training and testing looked different for submodel equipped with Alexnet and its equivalents with Resnets.The Alexnet submodel needed much more time to reach loss level represented by Resnet submodels.
Using the same strategy as described in the given example we selected the rest of submodels whose overview is presented in Tables 5 and 6.

Results of main models
To assess main models' effectiveness we decided to calculate the same metrics as in the submodels' cases but this time we expanded the analysis with specificity.In each of 3 pairs of compared models the same pattern can be spotted -the differences in accuracy between Stanford's models and authorial models are low.For instance accuracy of Stanford's model responsible for classifying images with abnormality presence reached 0.85 in comparison to authorial model's 0.87.
It seems that the neural nets created by Stanford ML Group work much better in detecting those images which don't have sought label.In contrast to them authorial models excel in finding disorders in the images that actually present joint with disorder.For example recall and specificity of Stanford's model classifying ACL tears reached levels of 0.759 and 0.924.The same metrics for authorial model stood at 0.9 and 0.924.
It's worth mentioning that Stanford's neural networks overtook the authorial models when it comes to the AUC levels.The possible explanation for that could be the difference in a way of selecting submodel's checkpoint among epochs during training.The original Stanford's paper mentions that the researched chose those versions which had the lowest averaged loss counted within epoch.On the other hand authorial submodels were picked according to the highest accuracy.

Summary
In conclusion we would like to say that the created models that served to classify 3 types of knee joint disorder achieved comparable results as their equivalents from Stanford University.Their differences in a way of selection are with no doubts a good material for further research.
It seems that the topic of classifying knee joint injuries using neural nets is worth spending much more time on it.In our opinion aspects like choice of pretrained model or the construction of submodel could be much better explored.It's also clear to us that such models should help the medical doctors not only in the proper classification but also In pointing the place where injury is located.This was the main idea of Researchers from Stanford University who implemented class activation mappinga heatmap generating technique which shows which part of the image were significant in classification.

Fig. 3 .
Fig. 3. Main models' architecture scheme All neural nets were trained using binary cross entropy loss given by:  = −[ * () + (1 − ) * ln(1 − ())(1) where x stands for predicted value, y for actual value, σ for logit function and w for weight loss.To balance unequal label distribution we multiplied losses from actual positive observations with reversed proportion of number of actual positive observations to number of actual negative observations.This operation can be describes as follows:

Fig. 4 .
Fig. 4. Residual block scheme Microsoft's researchers formulated a hypothesis saying that it's possible in an asymptotic way to estimate the outcomes of complicated functions.In the case of Resnet the authors took a step further and checked whether couple of connected layers are able to estimate outcome of residual function: () = H() −  (3) where H(x) is a covered mapping.It's possible to describe the way of working of residual block depicted in Fig. 4 in a following way:  = (, {  }) +  (4) where x, y are input and output tensor from the residual block.  describes the weights of i-th layer.The full version of F function can be expanded with 2 layers visible in Fig. 4 which gives:  =  2 • ( 1 ) (5) where σ is Relu activation function.The transformation presented

Fig. 6 .
Fig. 6.Alexnet's architecture scheme One of the most groundbreaking innovation was activation function called Relu which was described by following equation: () = max (0, ) (7) At that time most of the neural networks used hyperbolic tangent which led to slower training tempo.Another idea implemented in Alexnet was the local response normalization inspired by a phenomenon called the lateral inhibition which is characterized by excited neuron disabling its neighbors.This phenomenon leads to specialization of overexcited neurons in detecting certain patterns.Formally the local response normalization looks in the following way:  ,  =

Fig. 7 .
Fig. 7. Loss levels of submodels (classifying abnormality on train dataset) at the end of each epoch

Table 2 .
comparison of models' classification results we performed McNemar's test whose overview is available here [6] and here [4].This method uses convergence table which splits the same observations classified by two compared models into 4 groups presented in Table 2. Convergence table

Table 3 .
Classification of abnormality using axial plane

Table 5 .
Final submodels chosen to build main models

Table 6 .
Final submodels chosen to build main models

Table 7 .
Final models comparison compared with equivalents from Stanford University