A comparison of word embedding-based extraction feature techniques and deep learning models of natural disaster messages classification Porównanie technik wyodrębniania cech opartych na osadzeniu słów oraz modeli głębokiego uczenia w klasyfikacji wiadomości o klęskach żywio-łowych

The research aims to compare the classification performance of natural disaster messages classification from Twitter. The research experiment covers the analysis of three-word embedding-based extraction feature techniques and five different models of deep learning. The word embedding techniques that are used in this experiment are Word2Vec, fastText


Introduction
The existence of social media currently plays an essential role in assisting in every activity in the disaster management cycle. In the pre-disaster stage, social media can be used as an early warning before a disaster occurs [1]. At the scene when the natural disaster occurred, eyewitnesses shared information about the situation at that time. It can be used by volunteers or the government to deal with the impact of disasters. Whereas in the post-disaster stage, social media users share messages containing information on relief that has been carried out or information on locations that have not received assistance [2].
Natural disaster messages on social media are categorized into three: eyewitness and non-eyewitness, and don't-know [3]. Messages of the eyewitness category are natural disaster messages posted by eyewitnesses at the location when the disaster occurred. Messages in the non-eyewitness category are messages about natural disasters uploaded by users who are not eyewitnesses. In contrast, a message in don't-know category is a mes-sage in which there are words related to natural disasters, but the meaning is not about natural disasters.
Utilization of social media messages related to natural disasters for natural disaster management can be maximized with the help of artificial intelligence. Artificial intelligence can help find natural disaster messages faster [4]. The artificial intelligence system will classify social media messages into the three categories that are mentioned above.
The word embedding-based feature extraction technique is formed by the concatenation of word vectors into 1-dimensional data (1D) [5], [6]. Sentance vectors can be formed by arranging word vectors into a matrix (2D) [7]. Three 2D data created by each word embedding technique such as Wod2vec, Glove and fastText can be combined into 3 layers to form 3-dimensional data (3D) [8]. The output of the feature extraction process is structured data.
The deep learning method that can process multidimensional structured data is the Convolutional Neural Network (CNN) [9]- [11]. For text classification with 1D CNN with feature extraction technique based on word2vec [5]. The application of the 2D CNN technique to classify forest fire messages produces a good accuracy of 81.97% [7]. This study used three-word embedding techniques to create 2D data, namely word2vec, fastText and Glove. The application of text data classification with 3D CNN is made by combining 2D data based on word embedding consisting of three layers based on word embedding techniques word2vec [12], Glove [13] and fastText [8].
Another deep learning technique that is commonly used to classify text is the Long Short-Term Memory Network (LSTM). For the case of sentiment analysis on the IMDB dataset [14], the classification performance obtained using the LSTM model works better than the 1D CNN model. In this study, 1D concatenated data were used from the Word2Vec vector and the 1D CNN model. Meanwhile, the LSTM model uses input data from the tokenised form, which is then converted by an embedding layer based on Word2Vec. For the case of classifying natural disaster messages using deep learning models, the performance of the LSTM model is better than CNN [15]. The CNN model used in this study uses 2D data from the Glove vector and the 2D CNN model. And the LSTM model uses input data from the tokenized form, which is then converted by a Glovebased embedding layer.
Bidirectional Encoder Representations for Transformer (BERT) [16] is a recently popular deep learning method. BERT has achieved state-of-the-art results in a broad range of NLP tasks because of its ability to understand words more thoroughly [17]. BERT can provide a richer linguistic structure because linguistic knowledge is stored in hidden states and on attention maps [18].
The explanation above has provided knowledge of the methods used to carry out feature extraction and classification in the case of sentiment analysis and classification of natural disaster messages. However, it is necessary to carry out a comprehensive comparative study of these methods to obtain knowledge of the technique that can provide the best classification performance in the case of natural disaster messages. This research is done to answer following questions: 1. What are the classification performance of 1D CNN, 2D CNN, and 3D CNN models using the three-word embedding techniques in the feature extraction process? 2. What is the classification performance of the LSTM model using the three-word embedding techniques in the feature extraction process? 3. What is the classification performance of the BERT model? Existing research generally only uses a word embedding technique for feature extraction. This research also combines feature extraction results based on Word2Vec, Glove, and fastText for processing with 1D CNN, 2D CNN, 3D CNN, and LSTM models. The aim is to determine whether combining those techniques can improve the natural disaster message classification performance.
This report is divided into five sections: 1) Introduction, 2) Dataset and method section explaining the da-taset and classification algorithms that are used, 3) Research implementation section explaining the steps in the implementation of this study, 4) Result, and the last section is 5) Conclusions.

Dataset
The natural disaster message dataset used in this research comes from research [3]. Details about this dataset can be seen in Table 1. Each dataset has three categories or class labels: eyewitness, dont-know, and non-eyewitness. Examples of natural disaster messages from each class label can be seen in Table 2.

No Message
Class Label 1 We're a family pulled from a flood eyewitness 2 I'm ready for these earthquake memes dont-know 3 Houston streets flood again, dampening July 4th celebrations https://t.co/3I1aOZkNBx https://t.co/4i5kHBcRjo non-eyewitness The first message is from natural disasters such as floods and earthquakes uploaded by eyewitnesses when the disaster occurred. The secon message is message that contain the words earthquake and flood, but the meaning is not about natural disasters. While the last message is about natural disasters from the news that are re-shared by Twitter users.

Research implementation
The steps in the implementation of the research can be seen in Figure 1.

Text Normalization & Word Padding
Four natural disaster message datasets were normalized with steps commonly performed in text classification cases: removing double spaces, punctuation marks, numbers and non-alphanumeric characters [19]. The four text data that have been normalized are used as input for the BERT method to create a classification model.
Clean text data is then counted as the number of words in each message. After that, each message in each dataset is equated with the word padding based on the mean value.

Feature Extraction
The next stage is feature extraction. There are three groups of techniques based on the dimensions of the output data, namely, one dimension (1D), two dimensions (2D), and three dimensions (3D). Each of these techniques will use three popular word embedding techniques used in classification studies, namely Word2ve [12], fastText [20], Glove [13], [21]. Data structures with different dimensions are created using the threeword embedding models. The formation of 1D data is explained as follows.
If is the vector of a word, and n is the number of words in a sentence, then the 1D data is the sentence vector which is formed by combining all word vectors (Formula 1). The value of corresponds to the number of words searched for with a statistically based word padding technique, namely the mean.
(1) Formula 2 shows how to create 1D data by combining the three-word embedding techniques.
The feature extraction results in this way produce structured data with 1D dimensions, as seen in Table 3. From these results, 16 structured data were obtained, which were used as input to create a classification model using the 1D CNN method. The formation of 2D data with each word embedding technique is explained as follows. If 1 , 2 , to re word vectors resulting from a number of word embedding techniques and is formed by creating a two-dimensional matrix × as shown in Figure 2. Where is 100. This research also proposes the formation of 2D data from a combination of three-word embedding techniques. The combined 2D data can be seen in Figure 3. The 2D data is a two-dimensional matrix × , where is 100 and N is 3 × . The feature extraction results in this way produce structured data with 2D dimensions, as seen in Table 4. From these results, 16 structured data are obtained, which are used as input to create a classification model using the 2D CNN and LSTM methods. To create 3D data by combining 2D data from three different word embedding techniques. There are two ways of generating 3D data. The way of forming the first 3D data can be seen in Figure 4. 3D data type 1 is generated in 3 layers. The first layer is 2D data from the word2vec technique, followed by 2D data from the fastText and Glove techniques.
If 3D data type 1 is represented as a matrix, the dimensions are × × . Where is 100, is the number of words in a sentence, and is 3. The method of forming 3D data type 1 follows the method of forming 3D image data consisting of 3 RGB color channels. The way to generate 3D data type 2 is to follow the formation of 3D video. Video are a set of frames or images. For this study, a frame is formed by three vectors of a word from three-word embedding techniques. Then the next layer represents the second word, and so on. So that this data will have layers. If 3D data type 2 is defined as a matrix, the dimensions are × × . Where is 100, is the number of word embedding techniques, namely 3. is the number of words in the sentence.
The feature extraction results in this way produce structured data with 3D dimensions, which can be seen in Table 5.

Classification
Three deep learning methods are used in this research: CNN, LSTM, and BERT. Convolutional Neural Network (CNN) is an artificial neural network used initially in image recognition and processing [22]. The input and output of each stage are in the form of an array called a feature map. The output of each stage is a feature map of the processing results from all input locations. Each stage consists of three layers: the convolutional layer, the activation layer, and the pooling layer [22]. The convolutional layer is the first layer that receives direct input data to the architecture. The convolutional layer performs the convolution operation on the previous layer's output. The purpose of convolution on data is to extract features from the input data. Pooling layer is reducing the size of the matrix by using a pooling operation. Two types of pooling are often used: average pooling and max pooling. The activation function is a node that is added at the end of the output of each neural network. The activation function, also known as the transfer function, is used to determine the neural network output. In the CNN architecture, the activation function lies in the final computation of the feature map output or after the convolution or pooling calculation process to produce a feature pattern. Several kinds of activation functions that are often used in research include the sigmoid, tanh, Rectified Liniear Unit (ReLU), Leaky ReLU (LReLU) dan Parametric ReLU [23].
The result of these three layers is a feature map in the form of a multi-dimensional array. The feature map is then processed by the flatten operation to become a vector. Then the vector is processed by the fully connected layer. The fully connected layer is where all the previous layer's activated neurons are connected to the neurons in the next layer [24]. Each activation of the prior layer needs to be converted into one-dimensional data before it can be linked to all neurons. The Fully-Connected layer is usually used in the Multi Layer Perceptron method to process data so that it can be classified.
Long Short-Term Memory is a variant of Recurrent Neural Network (RNN). LSTM can conduct training and overcome vanishing gradient problems which are difficult for RNNs [25]. LSTM was created with the aim of overcoming the hidden layer problem. LSTM can learn long patterns from sequential data because it prevents vanishing gradient situations. However, LSTM still has the same principle as RNN and what differentiates it from RNN is the cell content. RNN is simple because with cells that only contain a layer of neurons with the tanh activation function. LSTM becomes more complex because the cell's contents are more than one layer of neurons. There is a layer of neurons called gates [26].

Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained contextual word representation model based on MLM (Masked Language
Model), using two-way Transformers [18]. The BERT model architecture is a multi-layer bidirectional transformer encoder-decoder structure. Transformers follow this overall architecture using self-attention and pointwise stacked, fully connected encoders and decoders. There are two steps in the performance of the BERT framework, namely pre-training and fine-tuning [18]. BERT pre-training does not use the traditional left-toright or right-to-left method but uses Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) for pre-training data. MLM fills in the blanks, where the model uses the context word around the mask token to predict what word should be, while NSP is the prediction of the next sentence with the two models given. After pre-training data, BERT will perform fine-tuning where fine-tuning is initialized with previously trained parameters, and all fine-tuning parameters use labeled data from downstream tasks.
Model building implementation uses the Python programming language with the Keras and Tensorflow libraries. In this study, three CNN models were made, namely 1D CNN, 2D CNN and 3D CNN. Table 6 shows the parameters used in each model in this study. Meanwhile, the LSTM and BERT models were built using the parameters shown in Table 7.  Table 1 shows the difference in the number of samples from each class so that it is known that this is a case of imbalanced data classification. So that the classification performance used in this research is F1 Score and ROC AUC.

Results
This research consists of 15 experiments for each dataset. So that the total number of experiments carried out is 60 experiments. Figure 5 shows a comparison of the classification performance of the earthquake dataset. The 1D CNN model produced the highest classification performance with the fastText word embedding technique, and the lowest was the BERT model.      Figure 8 shows a comparison of the classification performance of hurricane datasets. The 3D CNN type 2 model produced the highest classification performance, and the lowest was the performance of the LSTM model with the word embedding Glove technique. The effect of increasing performance with feature extraction combining the three-word embedding techniques works well when used on 1D data with the 1D CNN classification method. This technique also improves the performance of the LSTM classification model.  Figure 9 shows the average F1 score based on the classification method. This figure shows that the 3D CNN type 2 method as a feature extraction method proposed in this study, provides the highest average performance compared to other classification methods. In this research, BERT, a state-of-the-art classification method, cannot perform well in the case of natural disaster message classification.   Figure 10 shows the average F1 score based on the word embedding method. The highest classification performance is produced by the fastText-based feature extraction method. However, combining the three word embedding methods, the proposed feature extraction method can work better than the Glove and Word2Vec methods.
Further analysis of the performance of the 3D CNN type 2 model is to look at the predictive performance of each class in the dataset. Figure 11 shows the average AUC of each class using this model. This figure shows that the predictive performance of the eyewitness class is below the prediction performance of the other two classes. The main reason for the low average performance of this prediction is because the number of messages in the eyewitness category is less than the messages in other categories, except for the earthquakes dataset. Another cause is that eyewitness category messages generally contain concise messages, namely 2-3 words, so the structured data that is formed includes a value of 0. Those reasons can impact the model training process and decrease classification performance [27].
Analysis of the prediction performance of the classes was also carried out by comparing the performance provided by the model formed by fastText-based feature extraction with the incorporation of three word embedding techniques. The results of the performance comparison can be seen in Figure 12. In this result it can be seen the decrease in the prediction performance of class eyewitness from combining the three word embedding techniques. Figure 12 also shows that the predictive performance of the eyewitness class is below the prediction performance of the other two classes. The reason of this issue is same as the explanation in previous paragraph.

Conclusions
From the results of this study, it can be concluded that the formation of 3D data type 2 and 3D CNN models as the proposed method can provide a better average performance in the four cases of classification of natural disaster messages. In comparison, the proposed method combining three-word embedding techniques can improve classification performance in the 1D CNN and LSTM classification models.
However, the average performance for predicting messages from eyewitnesses is still lower than the predictive performance of other message categories. It is because the number of eyewitness category messages is less than the other categories, resulting in cases of unbalanced class classification, which decreases the performance of the minority class classification.
Future research will focus on solving unbalanced data classification cases by balancing the data before creating a classification model. This step is expected to improve the prediction performance for eyewitness category messages.