APPLYING A Q-GRAM BASED MULTIPLE STRING MATCHING ALGORITHM FOR APPROXIMATE MATCHING

We consider the application of multiple pattern matching (Multi AOSO on q-Grams) algorithm for approximate pattern matching. We propose the on-line approach which translates the problem from approximate pattern matching into a multiple pattern one (called partitioning into exact search). Presented solution allows relatively fast search multiple patterns in text with given k-differences(or mismatches). This paper presents comparison of solution based on MAG algorithm, and [4]. Experiments on DNA, English, Proteins and XML texts with up to k errors show that the new proposed algorithm achieves relatively good results in practical use.


Introduction
Approximate string matching is well known and widely used problem in stringology, with applications in spell checking, spam filtering, matching of nucleotide sequences, etc. Given pattern P 1..m , text T 1..n , both consisting of σ characters (called alphabet Σ), where m ≤ n, find all positions in the text where the pattern matches the text up to k errors (which is the maximal number of differences/mismatches). We specify three operations that cause difference: insertion, deletion, and substitution. The difference between two strings is also called difference ratio which is defined as α = k / m.
Multiple pattern matching is a classic problem with applications in bioinformatics, bibliographic data analysis, information retrieval, virus detection, data filtering, and other areas. The problem is to find positions of patterns P = {P 1 , …, P r } in the text T of length n where text and patterns are over common alphabet Σ of size σ. The patterns may be considered as of the same size or different sizes.
As stated before the approximate string matching problem is fundamental problem in text processing and heavily explored [1,2,4,[6][7][8]10]. Most classical models are based on filtration method with verification, e.g., Levenshtein or Hamming distance. The pattern matching problems can be divided into two categories: on-line and off-line (index, semi-index). We consider the online approach in this paper.
Our method is closely related to previous work [4]. Authors deeply explore the use of multiple pattern matching algorithm in approximate pattern matching problem. The algorithm is optimal on average for low and intermediate difference ratios (up to 1/2). Authors present a couple of variants of this algorithm and we refer to all of the variants using single name AOSMASM.
The purpose of this research is to adapt MAG algorithm to approximate string matching with k-differences/mismatches, to examine its efficiency and to compare with other existing solution.

Multi AOSO on q-Grams (MAG)
Multi AOSO on q-grams is on-line algorithm designed for multiple pattern matching. It scans the text only once to find a set of patterns and returns its positions in the text. MAG is a complex algorithm based on Shift-Or (AOSO) with use of many techniques widely adapted in text processing such as q-grams, pattern superimposition, bit-parallelism and alphabet size reduction. We chose this algorithm because it achieves quite good results in practical use, and fit to our requirements.
MAG uses q-grams which are a contiguous (or noncontiguous [3]) substring (factor) of q characters of a string. The q-grams have been widely used in approximate (single and multiple) string matching [10] as a filtering method, but also to speed up exact matching of a single pattern by treating the q-grams as a super-alphabet [5].

Counting filter
Counting filter [2,6,7] is a filtration method used to discard most of the space which does not meet a certain criterion. It is the one of the most popular algorithms used for finding approximate patterns with k-differences (=matching with up to k Levenshtein errors) or k-mismatches (=matching with up to k Hamming errors). The filter is based on simple idea, the algorithm counts the number of each symbol of the alphabet in the pattern and then moves the window (of size m) through the text checking how much the number of corresponding symbols existing in the window differs from the pattern. If this number is less or equal k then verification is triggered and possible match reported.
Lemma ( [3]): If there are i ≤ j such that ed(T i..j , P) ≤ k, then T j m + 1..j includes at least mk characters of P.

MAG for approximate pattern matching (MAGA)
We present a solution that allows relatively fast searching for the pattern in text with given k-differences (or mismatches). Our solution is based on following Lemma: Lemma: If pattern P 1..m can be divided into k+1 sub-patterns, the pattern with k-differences (or mismatches) can be found by searching all of the sub-patterns in the text T 1..n and verifying all found positions for k-difference (or mismatches) matching.
In other words, we search k+1 pieces of the pattern in the text using multiple pattern matching and when one of them is found we verify if the pattern with k-differences exists on found position. For example, if P = "abcdemogpcba" and k = 1, then we divide the pattern P into two pieces (sub-patterns) P 1 = "abcdem", P 2 = "ogpcba". As may be noticed now, if one character inside the pattern is modified, deleted or added, one of these two sub-patterns changes while the second one is exactly the same. For instance we change the second position in P so that we have P' = "axcdemogpcba" which is giving P' 1 = "axcdem" and P' 2 = "ogpcba". Comparing sub-patterns of P and P' we find that P 2 and P' 2 are still the same. These sub-patterns may be found with a single pattern matching algorithm by running it k+1 times, but a faster solution involves using a multiple pattern matching algorithm.
We use the algorithm Multi AOSO on q-Grams (MAG) [9], described in section 1.1, for searching all the k+1 pieces of the pattern; if any piece is found in the text the verification is triggered. The verification uses the Counting Filter combined with dynamic programming (for Levenshtein distance). In order to find the exact position of match algorithm needs to determine where exactly the filter should be started (1) and finished (2): (2) where pos is the position of sub-pattern in the text, offset is the position of sub-pattern in the original pattern, and k is number of allowed differences. When filter finds position where the pattern with k-differences may occur, then verification method is executed. For example, for given k = 2 the pattern P = "GGACACCAGAGGCGGGGA" is divided into three sub-patterns P 1 = "GGACAC", P 2 = "CAGAGG", P 3 = "CGGGGA" which are merged into single pattern P"

= [CG][AG][AG][ACG][AG][ACG]
where each of symbol is super symbol in super alphabet. The original sub-patterns are stored with the offset and used for later verification. If such pattern P" is found then algorithm looks up the sub-pattern that matches the position and then verifies if the pattern P matches with up to k-differences in text window that starts on (1) and ends on (2) position (Fig. 1). We use Levenshtein implementation to validate the position. If the validation is successful (the number of differences is less or equal k) then the match is reported. We can easily adapt it to k-mismatches by running the Hamming distance verification in place of Counting Filter on posoffset position.

Preliminary experimental results
The performance of the proposed solution was evaluated on datasets from the widely used Pizza & Chili corpus (http://pizzachili.dcc.uchile.cl/); we used 200MB files of DNA, English, Proteins and XML texts.
The codes of competitors were obtained from the authors and compiled as suggested. All our codes were implemented in C++ and compiled with g++ -O3. The computer was equipped with an Intel i3-2100 CPU 3.1 GHz (128KB L1, 512KB L2 and 3 MB L3 cache) and 4 GB of 1333MHz DDR3 RAM, and running Debian 3.2.63 x86 64.
In all experiments we ran MAGA with AOSO parameters set to U = 4 and K = 2. The parameter q (i.e., the q-gram size) used in all tests was set to {2, 4, 6, 8}. We decided to choose two variants of MAG with different alphabet mapping. We used combined alphabet mapping (q-grams creation is done on the flywithout mapping table) mag_dna_lx for DNA and mag_lx for the other datasets, where x is the value of l parameter which specifies the size of super alphabet (2 l ). We used different values of l parameter for DNA and other datasets as follows: for DNA we set l = 2 for m > 32 and l = 3 for m ≤ 32, for all other datasets the l parameter was fixed to 3 for k = 1 and m = 128 but l = 4 for other m and k permutations. The pattern size is constrained by two major factors which are the q-gram size and number of differences (k), therefore we narrowed the parameters as follows: for k = 1, m ≥ 16, for k = {2, 3}, m ≥ 32, and for k = 4, m ≥ 64. The AOSMASM algorithm was tested with all possible parameters described by authors (see [1] for more detail). There are too many variants of mentioned algorithms to present on chart so we decided to present only the most efficient variants (the best result) of the solution. MAGA is much more effective than AOSMASM for k less than 3 for English alphabet (Fig. 2b) but function of our solution is decreasing much faster giving worse speed for k equals 3. The results for DNA and Proteins (Fig. 2a, 2c) are worse so that AOSMASM is only little worse for k equals 1 but much better for bigger k (up to two orders of magnitude). It can be reason of quite small alphabet. A very small alphabet as in the case of DNA may cause that adjacent q chars practically never produce unique q-grams, which in turn triggers the verification more often. The worst case is for XML (Fig. 2d) file where MAGA has worse result for all k. This may be caused by the nature of XML files where tags many times repeat in the text. This phenomenon has impact on uniqueness of q-grams causing many verifications.  Figure 3 illustrates the effectiveness of mentioned algorithms in function of r (number of patterns). The chart shows that the performance of both solutions is almost the same for one pattern, but the advantage of MAGA grows together with a growing number of patterns. The performance ratio (search speed of MAGA divided by speed of AOSMSAM) of both algorithms equals 1.03 for r = 1, while for r = 10000 it equals 6 for English (Fig. 3b) and up to 12 for DNA (Fig. 3a). MAGA having worse results (perf. ratio 0.76) for r = 1 for Proteins (Fig. 3c) got performance ratio on the level of 2.09 for r = 10000. The results are optimistic also for XML file (Fig. 3d) what is weak point in Fig. 2 but has much higher effectiveness for r = 1000 and r = 10000. In this figure we examine the behaviour of both solutions when the number of patterns is increased from r = 100 (Fig. 2) to r = 10000 (Fig. 4). As expected, the performance of MAGA is much improved (in comparison to AOSMASM) when number of pattern is increased. For DNA (Fig. 4a) and Proteins (Fig. 4c) MAGA is more efficient than AOSMASM for k less than 3, but for English and XML the results are higher for k less or equal 3. Overall, it cannot be clearly specified which solution is better for given k because many other parameters (i.e. alphabet size, number of patterns) have much influence on the performance. Figure 5 shows performance results of searching 100 patterns in DNA, English, Proteins and XML text in function of pattern length. The results show that MAGA achieves better results for patterns longer than 32 for English and DNA (Fig. 5a, 5b), for longer than 64 for XML (Fig. 5d) and longer than 16 for Proteins (Fig. 5c). MAGA is designed on the top of MAG algorithms which is using q-grams that have major impact on the performance. The speed may be raised by increasing q size what on the other hand is limited by pattern length. This enforces use of smaller q size for shorter patterns causing performance issue. This is one of the reasons why MAGA achieves much better results for long patterns than small ones.

Conclusions and future work
Experiments show that the proposed algorithm achieves relatively good results in practical use. MAGA is more efficient than AOSMASM if k is relatively small, but it can by improved if large number (i.e. 10k) of patterns is searched (6-fold speedup). Taking into account that MAGA handles searching of large number of patterns better than competitors the results may be more optimistic for a couple of tests we did for 100 patterns. MAGA achieves satisfactory results in all cases where is need to search large number (1000, 10000, etc.) of long (>32) patterns in datasets with quite big alphabet (>4) and small number of differences (≤3). We found that MAGA algorithm may be applied to different problems of approximate pattern matching. There is still a lot of research and experimental work to be done in the future, concerning using various AOSO parameter combinations, testing on a larger number of patterns, using different alphabet mapping (other variants of MAG) and different datasets. We believe there is a significant potential in the proposed approach, which should stimulate future research.