PERFORMANCE COMPARISON OF MACHINE LEARNING ALGORITHMS FOR PREDICTIVE MAINTENANCE

. The consequences of failures and unscheduled maintenance are the reasons why engineers have been trying to increase the reliability of industrial equipment for years. In modern solutions, predictive maintenance is a frequently used method. It allows to forecast failures and alert about their possibility. This paper presents a summary of the machine learning algorithms that can be used in predictive maintenance and comparison of their performance. The analysis was made on the basis of data set from Microsoft Azure AI Gallery. The paper presents a comprehensive approach to the issue including feature engineering, preprocessing, dimensionality reduction techniques, as well as tuning of model parameters in order to obtain the highest possible performance. The conducted research allowed to conclude that in the analysed case , the best algorithm achieved 99.92% accuracy out of over 122 thousand test data records. In conclusion, predictive maintenance based on machine learning represents the future of machine reliability in industry.


Introduction
Today's industry is facing new problems associated with constant growth of production as well as higher accuracy and safety requirements. In addition, international market is very competitive in terms of prices. These prices are highly dependent on production speed and reliability. Machines and automatons are very important parts of a manufacturing process. It means that if certain component fails, it will cause financial losses related to downtime of the production process. Moreover, some failures may lead to the safety violations, which of course are far more undesirable.
To avoid unwanted danger and financial losses, many maintenance strategies are used in the industry. According to the Susto et al. [23] maintenance approaches can be classified as follows:  Corrective maintenance (also Run-to-Failure -R2F)this method consists of replacing or fixing a certain component after it fails. It is the most straightforward approach, which is also the most ineffective one. It leads to the additional costs associated with downtime and unscheduled maintenance, often including spare parts delivery interval.  Preventive maintenance (PvM)where maintenance interventions are performed regularly to avoid unscheduled stoppages. Time duration between conservations is based on knowledge about certain system component, but do not grant full usage of their life. Thus scheduled maintenance may cause additional costs related to unnecessary repairs.  Predictive maintenance (PdM)the goal of PdM is to forecast failures before they occur. It is possible thanks to the monitoring and data acquisition systems, which provides useful information about history of the machine and its current state.
Predictions are based on historical data, defined health factors, engineering approaches and statistical inference methods. Machine learning algorithms are proved to be very effective in terms of failure prediction and remaining useful life (RUL) estimation [10,17,24]. They can also be used in wide range of industry applications such as engine soot emission prediction [18], gearbox failure prediction [11], robotic manipulation failures forecasting [20]. Moreover, predictive models are very popular in other fields of technology. In [22] authors used decision tree algorithm for hard disc drive failure prediction. Korvesis et al. [15] predicted failures from post flight reports using random forest and support vector machine (SVM). Despite the fact, that machine learning methods are often utilized and gives good results, scientists are still working on some other interesting techniques [2,13]. Some of the forecasting tasks are complicated and struggle because of the missing maintenance history or other type of data so authors in [6] proposed a hybrid semi-supervised approach. Kanawaday and Sane [12] came up with idea to firstly predict production cycle parameters with ARIMA (AutoRegressive Integrated Moving Average) model and then feed supervised classifier with these values.
This paper presents a comprehensive approach to the predictive maintenance, where performance of eight machine learning algorithms with tuned parameters was compared. To the best of author's knowledge and according to [4] there is no such work in the literature.

Data structure and preprocessing
The data comes from Microsoft Azure AI Gallery and is dedicated for predictive maintenance modelling [25]. It consists of five datasets that contains useful information about a group of identical industrial machines. Every machine has its own identification number which indicates a model and age of the machine. First dataset includes real-time telemetry data, that is timestamp, voltage, rotation, pressure and vibration values. Error messages are in the second dataset. The rest of the datasets contains information about machines, maintenance history (timestamp and replaced component ID) and failures (timestamp, broken component ID).
Preprocessing starts with feature engineering, which is important to extract maximum of the useful information from the data. First of all, it should be determined how far back the algorithm should "look" in order to predict failures. It is so-called lookback parameter, because it is used to create lag features that constitute short term history of the machine. The width of this time window have to be discussed with an expert in a particular field. It is also very important to remember that if this time is too long, the data will be too noisy for algorithm to predict with satisfactory performance. On the other hand, if the time window is too small, it will contain too little information to determine the risk of failure. Further research about lookback parameter is out of scope of this work but 24h time window was chosen.
Creating lag features for telemetry data consists of calculating mean and standard deviation for every third record in the dataset. Next, to capture a long term effect, mean and standard deviation of last 24 hours is also calculated.
The error dataset contains timestamp and error message ID number for every machine. The amount of errors of every type in 24h lag window have to be calculated in order to find out what impact on failure probability it has.
Maintenance history is one of the most important datasets, so it is crucial for company to build system that collects such data. It is used to calculate the amount of days since last replacement of a certain machine asset, which provides very useful information about its degradation level.
Finally all the datasets (including machine and failure information) are merged together and prepared for labelling i.e. marking as a class that says whether or not the fault has occurred. But that is not the only way to do it. The authors in [19] have considered each of the devices and their components separately, labelling them as faulty or not. Thibaux et al. [5] decided to distinguish between three classes: "impending failure detected", "not impending failure detected" and "uncertain about future failure". In the case of this work, it was decided to consider the issue as a multiclass classification problem where it will be anticipated which of the four components will fail or none. In addition, it was assumed that the prediction would take place 24 hours in advance, although this time should generally be chosen in terms of maintenance time and spare parts availability. It means that each data record located 24 hours before the fault is marked as "incoming failure of component number x" or 'none" otherwise. Table 1 shows the structure of labelled data and sample values. Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network requires data input shape in the form of (batch_size, timesteps, features). In order to obtain such threedimensionality it is necessary to run an algorithm that will generate a 24-hour machine history (as an additional dimension) for each labelled data record.
The next step is to process the categorical data, which consists of mapping the ordinal features and encoding nominal features as well as class labels. Then the dataset is split into training and test subsets. Each data record has a corresponding point in time, so no random splitting or random sampling method can be utilized. The reason for this is that past events cannot be predicted based on future events (which might happen when using some methods). It is also unacceptable to use data that has arisen later than the point under consideration. Hence, a time-dependent splitting method has been applied by selecting one point in time as a division point and ignoring the records for 24 hours ahead. This eliminates the risk of information (created during labelling) leakage between these subsets. As a splitting point, 2015-07-31 1:00:00 was chosen and the 60:40 ratio between training and testing data was obtained.
It is proven that data preparation techniques such as normalization and standardization have a positive impact on the performance of prediction models [8]. Thus, data was standardized before model training.

Prediction models and validation
One of the breakthroughs in machine learning was the development of the perceptron learning rule : by F. Rosenblatt [21]. In the formula above,  is the learning rate (0÷1),  The idea of updating weights has led researchers to develop more sophisticated models. In this article, the following classification algorithms have been used for failure prediction: logistic regression, support vector machines (SVM), decision tree, random forest, gradient boosting classifier, artificial neural network (ANN), convolutional neural network (CNN), long short-term memory (LSTM).
Initially, the performance of algorithms with default parameters was tested for a different splitting and dimensionality reduction method. It appeared that the training and test data proportions has only a small impact on the accuracy, so for further research the 60:40 split was used. The next step was to investigate the influence of dimensionality reduction techniques such as principal component analysis (PCA) [9], linear discriminant analysis (LDA), generic univariate select (GUS), and recursive feature elimination (RFE) on prediction accuracy. For each algorithm, the method giving the best results was selected for further research. As a consequence, each prediction model has been prepared for the parameter tuning process.
Selection of the best parameters values was made by applying a grid search algorithm, which consists of finding the best result for every parameter combination from the grid. This method usually involves using the k-fold cross-validation with random sampling, which is unacceptable in this case. Therefore, the time-series splitting method for grid search was used to avoid overestimating the performance. Furthermore, for neural networks, the best topology was chosen by manual testing and comparison. The best parameters for each model are listed in Table 2.
In predictive maintenance, there is another important factor affecting performance. Failures are very rare occurrences among telemetry data, which leads to imbalance in the label distribution. Hence, the classifier tends to perform better predicting majority class labels than the minority. There are many solutions to this problem. Among others, undersampling can be used as authors in [3]. Alternatively, class weighting can be applied to increase or decrease algorithm's sensitivity towards specific classes. In this work, the methods proposed in [16] were used, especially Tomek's links and edited nearest neighbours (ENN). However, oversampling was not utilized because the number of newly added samples was unreasonably large compared to the efficiency improvement and the model training time became too long. Table 3 summarises performance metrics of each model for default parameters and once the data preprocessing and parameter tuning have been applied. The aforementioned label distribution imbalance has also an impact on how the model is evaluated. Failure-free records represent the vast majority in the dataset [14] so the algorithm can predict only a few faults while still maintaining high accuracy. Therefore, other performance metrics such as precision, recall and f1-score should be taken into account. These metrics are based on a number of positive and negative hypotheses, so for multi-class predictions, their macro averages are calculated. The results for CNN and LSTM algorithms confirm the problem described above. The accuracy values exceed 99% in both cases but the other metrics are much lower.

Results
On the basis of the presented results, it would seem that there is no need for sophisticated methods of data preparation and parameter selection, since the results are very good and their improvement is only a fraction of a percent. However, considering a company that would like to use such a system, any incorrect forecast can cost a lot of money, so it is reasonable to refine the algorithms to the perfection. The performance metrics of the gradient boosting (marked in the table) seem to be particularly interesting, as they have slightly deteriorated after the model improvement. Nonetheless, the overfitting has decreased, so the risk of worse behaviour towards new, previously unseen data will be lower. Among the introduced methods, three that achieved the best results in prediction of defects were chosen and their confusion matrices were presented ( Fig. 1-3). Comparison of these matrices and explicit selection of the best algorithm involves determining several requirements related to functioning of the system to which the application is dedicated. First of all, it is important to specify how expensive are the so-called false alarms, i.e. situations in which the model predicts a failure, when it does not actually occur. In addition, it is necessary to determine how harmful it will be to forecast failure of one component instead of another. Of course, not detecting the upcoming malfunction is the worst case scenario, because unscheduled maintenance is the most expensive and avoiding it is desirable. It can be seen that the gradient boosting algorithm and the neural network have a similar number of undetected faults, but the latter has falsely alarmed up to 90 times. The least such situations occurred when using the random forest, but it was less effective in detecting faults. For this work, the gradient boosting algorithm has been chosen as the best because it provides the least undetected faults while maintaining a reasonable number of false alarms. The ROC (receiver operating characteristic) curves (Fig. 4) for this model show that failures of components 1 and 4 are detected with almost ideal efficiency, in contrast to failures of components 2 and 3, for which the area under the ROC curve (AUC) is even smaller than for none.

Discussion
The conducted research shows that machine learning algorithms, in particular, gradient boosting, random forest and ANN, give the best results in prediction of industrial machines failures. In predictive maintenance, selection of the best algorithm is based on a financial analysis of the problem. In presented case, the gradient boosting was chosen. It obtained an accuracy of 99.92%, falsely alarmed 34 times and did not detect 39 faults out of over 122 thousand test data records. This leads to the conclusion that predictive maintenance based on machine learning is the future of many industry sectors. However, it requires an appropriate early warning system. An example of such application can be found in [1]. In addition, if maintenance of one component is much more expensive than the others, it will be possible to use a class weighting mechanism. As a result, the algorithm will be more sensitive to a failure of certain component.
The final performance of the production process depends on the proper data preparation, algorithm selection and parameter tuning. This article presents a comprehensive approach to the predictive maintenance issue considering all of these elements, as well as the evaluation criteria for such a system.
At the selection stage of artificial neural network topology, it was found out that the best results are obtained with one hidden layer. Thus, the classified data are approximately linearly separable, which is also confirmed by the fact that the SVM algorithm obtained the best results with a linear kernel. Nonetheless, in order to achieve the maximum possible predictive performance, the hidden data correlations need to be further examined.