Model of the text classification system using fuzzy sets Model systemu klasyfikacji tekstu z wykorzystaniem zbiorów rozmytych

Classification of work’s subject area by keywords is an actual and important task. This article describes algorithms for classifying keywords by subject area. A model was developed using both algorithms and tested on test data. The results were compared with the results of other existing algorithms suitable for these tasks. The obtained results of the model were analysed. This algorithm can be used in real-life tasks


Introduction
With the increase of text information amount, it is very important to understand which area the text belongs to. When working with databases containing scientific papers and articles, getting the subject area of the work without looking through its entire text is need. In this case, work attributes such as keywords can be used. However, manually assigning keywords or subject areas will take too much time and resources. Therefore, automating the definition of text attributes will save time and resources. Also, when selecting attributes for text information, in reality, you can get belonging not to one class, but to several. Modern classification systems must support this capability.
The purpose of the article is to develop, describe and test a text classification model. The model must have high accuracy and non-binary classification capability.

Review of text analysis methods
In the [15] a comparative analysis of the best way to classify complaint texts for a state online complaint service in Indonesia was made. The following algorithms participated in the comparison: Naive Bayes [16][17], Maximum Entropy [18][19], K-Nearest Neighbours [20], Random Forest [21][22], and Support Vector Machines, and two ensemble strategies -hard voting and soft voting. The results also indicate that generally all the ensemble methods performed better than the individual classifiers.
In [23] a keyword recognition system for texts using neural networks and hierarchical taxonomies was described. For each category in taxonomy own pre-trained neural network is used. The neural network uses function "logistic" and solver "adam". Using the combined hierarchical system of neural networks, with a sample of 2843 documents, an accuracy of 77.87% was achieved.
In [24] a new approach to categorizing text for category recognition for online newspapers was described. The approach has strict rules that determine a possibility of using a particular database for classification. To classify the system, the methods Support Vector Machine (SVM), Hidden Markov was chosen. Using this approach gives high accuracy. When using Hidden Markov Model (HMM) [25][26], the best accuracy was obtained.
In [27] an automatic text classification system based on the genetic algorithm classifier has been developed. Before classification with genetic algorithm, the text data was pre-processed, translated into correct representation, and the selection of features was made. When testing the system, 20291 documents of 6 categories were used. Using 1000 selected words, a performance of 0.748 was achieved, which is more than using kNN and decision tree classifiers.
In [28] a method for investigating the temporal patterns using keywords in the comments of online newspapers was proposed. The system should return a conclusion based on the content of the comment. Text data has been cleaned and categorized. As a result, the following conclusions were drawn: the drop in activity is associated with the end of the event and the time before this event; lexeme frequency maps were obtained; using temporal analysis, you can get a clear picture of the sentiment of changes.
In [29] a system for extracting keywords and sentiment from Twitter posts was developed. The Archivist API was used to collect information. The sample consisted of 40 000 tweets. The using the developed system in [30] with knowledge enhancer and synonym binder allowed to improve the results from 0.1% to 55% in comparison with the usual keyword search.
In [30] semi-automatic context analysis and text correction using specialized linguistic graphs was used and a system for text correction and context analysis has developed as a part of logic of web application. Developing the model, graphs composed of special neurons of different types was used. Comparison with similar programs for text correction has performed. The results of [30] are: an effective graph model for writing words and punctuation marks appearing in the examined text; developed methods to obtain texts from several sources for graph construction; text analysis and contextual adjustment methods developed and implemented; web application (website) was made that allows to use of implemented algorithms.

Classification model using fuzzy-sets
There are many ways to classify texts. This article will consider a method for classifying texts that is suitable for classifying scientific papers in different subject areas. The model is based on «fuzzy sets» [31,32].
Fuzzy set is a set, each element of which is matched with a real number in the range [0; 1], which indicates how much the element belongs to set [31].
The idea of such a classifier is to assign a real number within the range [0..1] for each possible variant. This classification method is more flexible than the usual classification method, where each variant is assigned an integer from the set {0, 1}. When writing the article, two versions of such a classifier were used and compared: using key phrases and using keywords.
The classifier model can be divided into two parts: training and direct classification. The model uses a normalized fuzzy set during training and classification. Values are normalized upon classification.

Training model description
Let there be a set containing N subject areas, among which the classification should be performed. As input data for training, the subject area and keywords that relate to the subject area are given in a form of a set: where Psubject area for key phrases, Tp1 .. Tpkkey phrases.
Depending on Tpa, subject areas P are added to the corresponding set Ma (2).
A set Ma has structure as below: where Ipakey phrase, P1..PNsubject areas, i1..iNinteger number, indicating the frequency of use of the key phrase in the subject area. It should be noted that the set does not allow duplicates. The set of Ma sets, each of which describes its own keyword, makes up the set Mp, which is the result of training: where M1..Mma set containing information about in which subject areas and with what frequency the key phrase occurs. As a result, the more training material is available for training, the more power the set Mp has, and the more accurate results the algorithm can provide.

Model of classifier description
The input data for the finished classifier is the set S, containing the key phrases of the work, the classification of which must be carried out: where s1 .. sN -strings containing key phrases.
The key phrases have the form of strings, each of which comprises words separated by whitespace.
The output of the classifier is the set Out: where f1 .. fNreal numbers in the range [0.
.1], which show the value of belonging to a particular subject area, P1 .. PNsubject areas.
Also done normalizing each fa multiplying it by the normalization factor j: where max(a1..aN)maximal value in set {a1..aN}. The classification algorithm is as follows:

Classifier modification
To increase recognition by the classifier, a modification of the classifier is proposed. The modification consists in using not key phrases, but their componentskeywords for training and classification. Keywords can be obtained by splitting key phrases by words.
In the modified model of the classifier, during training in (1), keywords are given as Tp1..Tpk; when classified in (4), keywords are given as S1..SN.

Model and algorithm realization
The model using the above algorithms was implemented in the Java language. HashMap and ArrayList were used as sets. The input data was submitted as file with CSV format (comma-separated values). Every row in the file has values: authors, title, link of work, author keywords, index keywords, subject area. Output data printed in console.

Input data preparing
The Scopus database was chosen as the input data source [33]. The database is free and allows you to export the required attributes of the works to a file of the required format. All papers from Lublin University of Technology were taken as exported data. A total of 6543 works were selected, of which 655 were used for testing. The resulting data sample was divided into two parts, 90% and 10%. The first part will be used for training, the second for testing. The classification was made between 27 subject areas.

Evaluating classification results method
Since the classification is done by fuzzy sets, you need to decide what values at the output of the classifier are considered sufficient to assign a keyword search. In evaluating using next variables: The model leaves in set all values greater than 0.5. Choosing the sorting of 0.5 due to the fact that if the fa is smaller than 50% in the fuzzy set, such variant under repeated less probable than others.
Then Out is compared with the set R: where Pa1..Pam are real subject areas for the S keyword list.
The goal of the classifier is to get an Out that is as close as possible to R. It is also necessary to evaluate the classification accuracy. The model uses the following algorithm:

Algorithm 2: Accuracy evaluation
Input : Out, R; In algorithm next statements used: where |(R ∩ Out)| -number of right keywords/ key phrases in Out, |Out| -number of right keywords/ key phrases in Out.
where |(R ∩ Out)| -number of right keywords/ key phrases in Out, |R| -number of right keywords/ key phrases in R.

Text classification using popular classificators
For comparison with the work of the developed algorithm, classifiers of the following types were tested: Decision tree, Support-vector machine, Backpropagation neural network. The following results were compared: • using first keyword, • using first three keywords, • using last three keywords, • using random three keywords, • using all keywords.
Since the ability of the model described in the article to perform multiple classification is already superior to the models used in comparison, which perform only binary classification, therefore, to be able to compare them, the model developed in the article will also perform binary classification.
Model building was performed in KNIME -opensource software for data analysis. KNIME allows you to work with data, perform data processing and visualize the results. A visual editor is used to work in KNIME. It is also possible to use code inserts in programming languages such as Java, Python, Javascript.

Analysis of the obtained results
The analysis is presented separately for the results of comparing binary and non-binary models. In comparison of binary models, the analysis of the results obtained during the operation of the two algorithms described in the article, as well as the results of the operation of the Decision tree, Support-vector machine, Backpropagation neural network models, is given. In comparison of non-binary models, the analysis of the results obtained by the operation of the two algorithms described in the article are presented.

Analysis of the binary models
In Table 1, the simulation results for binary classifiers are presented. When using a small count of keywords, the best results are obtained with the Decision tree. The results in the Support-vector machine and Neural network are almost unchanged when the number of keywords is changed. When using the maximum count of keywords, the best results are obtained with the modified algorithm described in the article. A visual representation of the accuracy of these models is shown in Figure 1.

Analysis of the non-binary models
Two different models were implemented: for the standard version of the algorithm, and for the modified one.
For each model, 6554 work data sets were used for training, and 655 work data sets for testing. When considering the successful operation of the system subject to at least one true subject area model is successful in 79% for the standard algorithm. The arithmetic mean for the standard model is 47 (Table 2). For the modified algorithm, more averaged values were obtained, the total number of positive classifications was 78%. The arithmetic mean for the modified model is 40% (Table 3). The difference of arithmetic mean for the models is due to a more blurred result of the modified algorithm. Considering that is not necessary that for each set of data has been full compliance, it is a satisfactory result. Figures 2 and 4 show the classification accuracy for standard ( Figure 2) and modified (Figure 4) algorithms.
Probability values are grouped for clarity. It shows that in contrast to the standard model in which clearly shows the standard probability distribution, in the modified model blur occurs to low probability values.  Similarly, for a number of suitable domains, unlike the modified model in softer distribution -there are less complete correspondences (Figure 4,5). For the modified model, there is a shift in the average probability below 50%. The majority of correctly classified samples (as in the standard model) have only one correctly classified subject area.

Conclusions
The described classification algorithm is suitable for determining the subject area by work's keywords and key phrases. A model has been implemented that uses both versions of the algorithm and tested on real data. The obtained accuracy estimate of 79% allows one to determine with great accuracy not only the main, but also additional areas to which the work is only partially related. The standard version of the algorithm is more successful with large amounts of data, the modified version is more flexible and can work with new and incomplete data. The developed models can be used as binary classifiers with good accuracy. The developed algorithm can be used in databases of scientific papers and other types of text data.