CONSTRUCTION AND VERIFICATION OF MATHEMATICAL MODEL OF MASS SPECTROMETRY DATA

The article presents issues concerning construction, adjustment and implementation of mass spectrometry mathematical model based on Gaussians and Mixture Models and the mean spectrum. This task is essential to the analysis and it needs specification of many parameters of the model.


Introduction
Mass spectrometry is popular, widely used technique of determination of complex data composition.It is essential technique used in proteomic, applied to identification of proteins and their dependencies in biological tissues [9,10].Proteomic approaches to interpretation of biological phenomenon on the level of proteins constitutes opportunities for development and advancement of medical diagnosis in many diseases area.In particular the proteomic analysis offers great promise to understanding the process of tumor development in human organism and its response to the therapy.Proteomics gives hope to the oncology development through better understanding, improvement diagnosis and development of new more efficient drugs.
The most important proteomic branches are: identification of proteins in the analysed sample, proteins features characteristic, specification of the proteins number in the sample and comparison of the proteins features.
Proteomic methods [22] must be able to deal with huge amount of data and information.High-bandwidth mass spectrometry, identification of protein complexes, studies of protein mixtures and evaluation of proteins expression requires advanced techniques.The development of proteomic techniques and tools gives opportunities of detection and analysis of proteins primarily in terms of medical diagnosis and new methods of treatments.Also statistical analysis, especially classification plays an important role in the process of the analysis of large data sets.There are still ongoing search for efficient methods of classification fulfilling the high requirements for high sensitivity and specificity.Moreover, the identification of proteins is a hard task which needs to consider many factors [34] (not always unique amino acid sequences, possible several different coding genes, existence of polymorphism genes resulting in different proteins variants, posttranslational modifications, etc).Today proteomics examines the complex protein interactions and modifications in terms of simultaneous analysis of thousands of data.Proteins identification is usually carried out by comparing the measured properties of the protein to those known and documented, available in biological (proteomic) data bases.The mass of protein derived in the mass spectrometric studies is one of the most commonly used properties.It enables low or non-invasive study of protein profiles in blood, plasma or urine.
Mass spectrometry is an analytical technique that allows accurate measurement of mass to charge ratio of the proteins.Is used to identify chemical compounds and to determine their structure and elemental composition.In proteomics, this method is used primarily to determine the composition of complex mixtures, in particular the identification of proteins.The spectrometer work consists of three stages: ionization of molecules, the selection of charged particles and their detection.
One of the most popular mass spectrometry used in the proteomic research is Maldi-Tof.It is based on using of matrix absorbing the laser beam.Sputtered and spared in the electric field ions hit the detector which determines the mass of ions on the basic of their velocities and time of flight through the spectrometer [26].
Result of spectrometric studies is presented in the form of the mass spectrum.The spectrum presents the dependence of mass-tocharge ratio (M/Z) and intensity.Intensity determines the number of ions that hit the detector in a small, fixed time interval.This interval is determined is the time resolution of the instrument and usually varies in the range 1 -4 nanoseconds.Examples of mass spectra are presented in Fig 1.

Models of mass spectra analysis
A model is mathematical data representation used to present a process or phenomenon in a simplified manner.This way of the process description allows better understanding of its characteristics.For example, the model can be created through the construction of the classifier using a specific learning set and data set.
In the case of high-bandwidth mass spectrometry data important issue is to determine the purpose and tasks of the analysis.The next step is to select appropriate methods and tools for the data exploration.The analysis is performed to answer questions about patterns and the most important features in the test dataset.The phases of this analysis are presented in Fig 2 .The analysis of data groups may be based on different kind of method such as clusterization (unsupervised learning method), classification and regression.Proteomic, data are often analysed IAPGOŚ 1/2013 ISSN 2083-0157 with support vector machines (SVM [4,35,36]).Highly multidimensional data need also dimension reduction techniques like PCA (Principal Component Analysis) [35], PLS (Partial Least Squares) [38], ICA (Independent Component Analysis [11,19] or MDS (multidimensional scaling) [34].They enable to choose the most significant features from the classifier point of view.However, most of dimension reduction methods have limitations, like independence of the analyzed variables, or necessity of the normal distribution.

Fig. 2. Typical for proteomics construction and the data model analysis scheme
Before mass spectrometry classification analysis data need to be pre-processed and prepared.It is of great importance for the process of the further analysis and quality of obtained results.
Pre-processing steps may vary depending on the specific type of data and proposed exploration method.In the case of proteomic data coming from mass spectrometry studies noise correction, baseline correction, normalisation, spectra alignment are usually required [26,27].Sometimes also missing data are handled.Noise removal and baseline correction is usually necessary due to matrix noise, other sample contamination and instability.Amount of noise in the spectrum is a determinant of the spectrum quality.Noise and baseline corrections can be performed with multiple shifted windows with defined width, discrete-wavelet transformation (especially undecimated discrete-wavelet transformation, UDWT [16,20,21]), the least-squares digital polynomial filter (Savitzky and Golay filters) or nonparametric smoothing (locally-weighted linear regression with specified window size and type of kernel).There are also methods considering peaks shape and localization, based on the rule the higher the weight, the wider and lower peaks.The important issue is also resolution of the device.Resolution is determined the percentage of the weight.While the weight increases, the device resolution worsens.That is why spectra of low molecular mass have better resolution and their peaks are more separable.The shape of the peaks, however, varies along with the spectrum.The higher the M / Z is, the lower and wider peaks may become.The most popular method of normalization consists in scaling all spectra to total ion current (TIC) value or to constant noise.Sometimes also trimming, binning and the mean spectrum calculation is performed.
The primary source of information about proteins, their sequences and the genes encoding them constitutes biological databases.Data contained in such databases come from research experiments and their interpretation, publications and other databases.Research centres which undertake the construction and maintenance of biological databases often cooperate and exchange data.The problem of biological databases are huge growth of information and the associated need of standardization and structuring of stored information.Biological data, in particular protein data, are difficult to manage because of the continuous flow of information and lack of uniform standards and methods of naming classification.Continuous exchange of data contained in different databases enable frequent updates.On the other hand, there is a need of continuous control of data quality and consistency.There are conducted works on automation of this process and developed of new comparison and integration methods.The best-known biological databases are: UniProt, NBCI, KEGG, EXPASY, HPRD, EPO-KB.
However, before the classification and proteins searching one need to find peaks of the spectrum.There are different models used to perform this task.The most popular one is based on local maxima and minima chosen from the mass spectrum [26,33,39].The real peaks can be chosen only among local maxima which are higher than a the signal to noise ratio (S/N) [15,23,24].It enables detection of peaks with the highest values of intensity [1,37].Other methods try to distinguish between noise and the real peaks considering the shape of peaks [17,18].
A model proposed by K. Coombes, K. Baggerly, J. Morris et al [10,26] defines distribution of the mass spectrum of the signal in accordance with eq. 1. ( (1) where f(t) is the observed signal, B(t) -Baseline, N -normalization ratio, e(t) -noise.Correct signal is defined as S(t).It can be modeled as a sum of independent, sometimes overlapped peaks, each of which corresponds to a single protein.Peak shapes can be estimated empirically on the based on physical simulation of the ToF analyzer process.The noise e(t) is defined as white noise [14].Baggerly et al. [2] considering the inclusion of the additional noise model time-dependent factor.The method assumes using of the mean spectrum, undecimated discrete wavelet transformation (UDWT) denoising and local minima and maxima analysis.
There is also a group of methods modeling spectra with a set of member functions.Decomposition may be based on the wavelet transformations [13,30] or composition of Levy processes [8].
Mixture models are a good way of large data sets modeling.They are usually used to model natural phenomena and biological processes.They can be also applied in image processing and clustering.Mixture models are often complex, they consist of many individual probability distributions.Mixture models allow interpret the whole population as a composite of an adequate number of sub-populations, which enable to perform detailed analysis and obtain better estimation.In practice, the most commonly used mixture models are based on Gaussian distributions.Such mixtures are known as Gaussian Mixture Models (GMM).
The main task associated with mixture models is to determine their parameters.The number of unknown parameters is 3k-1, where k is the number of Gaussian mixture's components.
For each mixture component one need to estimate both its Gaussian parameters and its weight.The parameters estimation task may occur to be complex.The more components are included in the mixture, the harder and more time consuming is the estimation task.
The task of mixture models parameters solving can be treated as a missing data problem.It can be formulated as a task of determining the membership of a group of data points to one of the distributions in the mixture.This membership is unknown and must be estimated.Parameters of the model should therefore be chosen so that the data points were represented by their membership to the individual components of the mixture.

Expectation-Maximization Algorithm
The parameters of the mixture model need be estimated with a method, which is able to handle the missing values.In case of complex problems where the number of parameters to estimate is large, typical estimation methods, like the maximum likelihood, are not appropriate to solve this tasks.An additional difficulty is the existence of many local likelihood function extremes.Therefore, likelihood maximisation of the data fit to the Gaussian mixture model can be performed with Expectation-Maximization algorithm (EM) [12].The EM method assumes the existence of hidden variables.In the case of mixture models the hidden variables represent variables defining affiliation of the each observation to one of the Gaussian components.
EM algorithm is an iterative method, consisting of a repeated measures (E and M).Diagram of its operation is shown in Figure 3. E and M steps are executed in a loop until the accuracy is reached.In each subsequent step new parameters values are calculated.If the number of components is known, the initiation step of the algorithm is to determine the initial conditions.
The use of GMM for spectra modeling is based on the assumption that one peak is represented by a single Gaussian distribution.Each Gaussian defines and models one component of the spectrum.To calculate the model parameters one need to use a modified version of the algorithm Expectation-Maximization.This version is adjusted to the nature of spectrometry data.
Application of the EM algorithm requires adjustment of input data, because the standard version of EM requires a onedimensional input vector.For the purposes of spectrometric data a weighted version of the algorithm was implemented.It considers the intensity of the spectrum characterizing the repetition number of the corresponding M / Z values.

Mass spectra modeling using Gaussians and GMM
The protein mass spectra decomposition methodology might be based on Gaussian Mixture Models.This way of mass spectra processing is possible and gives good results, which has been confirmed by the results of the analysis of real data (presented later in this paper).
This mass spectral analysis method is based on peaks modelling with Gaussian distributions.Such modelling involves the appropriate choice of distributions' parameters.It enable to represent peak shapes and to model the spectra in the easy way.
Using of Gaussian distributions to model peaks allows to consider measurement errors which also can be modelled using Gaussian distributions.In most of the known methods these errors must be corrected by the difficult process of alignment of the peaks in different spectra collections.Gaussian distributions modelling does not need the aligning process, because when probability distributions are used, measurement ambiguities are allowed for a single spectrum.
A single Gaussian distributions may be applied, however, only in the case of spectra with a perfectly separable peaks.In practice, the spectra analysis is done using mixtures of distributions instead of single distributions.
Moreover, the possibility of modelling mass spectra using mixtures justifies the idea of spectra modelling with single Gaussian distributions.GMM spectra modelling is based on the assumption that the individual peaks are modelled by Gaussian distributions.However, the use of the mixture instead of single Gaussians considers the additional interaction between closely located peaks.Mass spectra reflect the number of processes occurring in the human body.These processes are often correlated and it has its reflection in a spectrum and in the lack ofseparability between the individual peaks.This fact should be upheld, as it brings information together.Application of mixtures enables considering the dependences between peaks and modelling of overlapped measurement errors present in adjacent regions of the spectrum.Separate peak identification could lead to incorrect variance estimation of individual Gaussians, since it is impossible to make their total separation.Simultaneous peak modelling can solve this problem.The measurement inaccuracy correction conducted in this manner allows to increase the quality of the assessment model.Solving the parameters of the mixture, however, needs to define many properties, such as the number of model components, the initial parameters values, the convergence criterion, calculations accuracy or the use of the mean spectrum.

Problems of the model implementation
Expectation-Maximization algorithm is efficient method of GMM estimation.However, to obtain repeatable, reliable results one needs to appropriately chose all parameters, such as the model parameters, its correctness, number of the model parameters, initial values, stop criterions, calculation accuracy and quality assurance.
Parameters of the pre-processing can be adjusted to the specific data before the main procedure of analysis.The order of these operations is fixed and includes: averaging technical repetition, outliers detection, baseline correction, normalization, interpolation, calculation of the mean spectrum.This order is a standard which has been developed over the last few years of research in this field.The most important parameters of the preprocessing are: the window size of the baseline correction and using (or not) the mean spectrum.
The important element of the analysis is appropriate choice of the mixture options, in particular the number of components.There is a possibility to use different methods of number of components estimating such as the BIC criterion, the basic functions of peak detection, statistical methods determining the density distribution of the parameters.
What is more, the characteristic of the EM algorithm is of great importance for the analysis.The level of the estimation task depends on the number of components and the sample size.
The EM algorithm generates some specific types of errors.The most frequent one is merging of distributions.This phenomenon occurs when at least two distributions with similar means value occurs.The merging probability is greater, if also standard deviations are similar.When the number of components is fixed, this join results in generation of additional distributions, which usually have small weight.Another type of error is generation of distributions with large standard deviation and low weight.In practice it results in long, flat distributions.Sometimes additional distributions with small standard deviation are also formed.
Another important feature of the EM algorithm is that the better and quicker adjustment are usually obtained for the components with the larger weights.This is a desirable feature, when the goal of the analysis is to find main elements of the modelled process.But it can be problematic in the case of high requirements concerning parameters fitting.
EM algorithm is an iterative, non-linear algorithm.Its convergence is fast only in the initial phase of operation (Figure 4).After a dozen of iterations the speed of approach significantly decreases and the results of successive iterations differ very little.This feature of the algorithm might longer the duration of the whole method, especially in the case of a high accuracy specified.The algorithm has also high computational complexity, especially the M step.Despite the relatively slow convergence, premature interruption of the algorithm can cause errors.The most common problem is to find a local maximum.This situation illustrates the Figure 4.The method cannot find the parameters fulfilling the criterion of the maximum likelihood method and it remains at the point where the estimation satisfies the conditions of insufficient local maximum.

Fig. 4. The example of convergence of the EM algorithm
The local maximum problem is usually caused by improper initial values of the algorithm.To improve the method results the multiple repetitions technique can be used.It involves multiple runs of the algorithm with the initial values changed.This solution enables to choose the estimation with the maximum value of the likelihood function.This method gives good results and it effectively improves the estimation level.However, it causes extending of the duration time and higher calculation complexity.
The problem of the local maximum might be improved with careful selection of the input parameters.
The most obvious and quickest way to obtain initial parameters is the random selection.However, such a choice can be made in several ways.One is the random distribution of data into specified number of elements (this number is usually equal to the number of mixture components) [25].The is also possibility to use one of the clustering algorithms, such as k-means or hierarchical clustering.However, using this methods might be costly, especially when it has to cover a large group of data.That is why there solutions which are based on a randomly selected samples.An alternative method of the input parameters determination is generation based on the primary peak detection method based on local extremes.Obtained results, with Gaussian arousals added, allows quick setting of the input parameters localized around the relevant procedures results.
The task of mixture's parameters estimation requires known number of components.Obtaining this number of components is important and hard task, especially in case of complicated mixtures with overlapped peaks.
One of the most efficient methods of the components number are information criteria.
The most stable of the analyzed criteria occurred criteria: BIC, AIC, AWE and NEC.The results presented on the graph are very similar, because the important factor in their formula is the value of likelihood function.It makes those criteria to be monotonic.Too small values of components result in significantly increase criteria values.The point there the graph is stabilizing should be treated as the optimal value.Using of the likelihood function makes the criteria value continuously growing, it is slight increase.Implementation of AWE and NEC criteria enforce maximization of the AWE criterion and minimization of the NEC criterion.AMIR and ICOMP criteria do not show so clear upward trend with an increase of the components number.
The last important issue is identify the type of convergence criterion and its accuracy [29].The most commonly used criterion is are based on the maximum likelihood method.There are, however, other criteria like the difference between two successive values of likelihood function or different distance metrics between two successive values of the likelihood function or between successive estimates of parameters.According our simulations all tested criteria give similar results of parameters values.However, some of them run faster and require a smaller number of iterations [30].The best results were obtained for the chi2 distance between the two constituent values of the likelihood function.Chi2 distance calculated for constituent values of parameters also gave good results in a relatively small number of steps compared.The worst results gave the distance between successive values of the parameters based on the sums of gradients between successive values of the parameters.

Fig. 5. Comparison of criterions of components number estimation
According our simulations all tested criteria give similar results of parameters values.However, some of them run faster and require a smaller number of iterations [30].The best results were obtained for the chi2 distance between the two constituent values of the likelihood function.Chi2 distance calculated for constituent values of parameters also gave good results in a relatively small number of steps compared.The worst results gave the distance between successive values of the parameters based on the sums of gradients between successive values of the parameters.
The important aspect is also determination of calculation accuracy.Too low accuracy can cause the local maximum problem.Too high value will lead to an excessive increase of the computation time length.
Besides the mathematical analysis, the significant aim is identification of proteins or peptides present in the sample.Implementation of classification enables initial determination of predictive model power that allows for performing the functional analysis of identified peaks.Data classification is difficult because of a strong correlation of data combined with its high dimensionality.Row spectra, composed of many thousands of features and dozens of objects require a two-stage dimensionality reduction.The first stage is decomposition with Gaussian mask put on all spectra.This operation enable to reduce the dimensionality to several hundred of features.The second step was the selection of the most informative features in the process of dimension reduction performed with such methods like T-test, SVM-RFE or PLS.Due to the high level of correlation it is necessary to the implement classification based on features.Prepared in this way data set can be subjected to SVM learning and classifying.Appropriate adjustment of the classifier and features choice enable efficient peaks discriminatory.Selection of classification parameters and the proper number of features can be performed with Multiple Random Validation.Results performed on a medical dataset have shown that classification conducted in this way is possible and effective.Total error obtained at the level of 18% without the information leak confirms the effectiveness of the classification method.
GMM based spectra modelling, preprocessing and classification could be supplemented with biological interpretation based on dedicated databases.Integration of identification process with biological databases can be used to verify the results from the biological point of view.This verification is essential because of I type error (False Discovery Rate), typical classification task error.It is important to minimize this error, but it is impossible to do it using classification methods only.Therefore, the opportunity to conduct biological verification gives another point of view.The analysis helps to improve explaining of the processes occurring in living organisms.

Results
The method was tested with the set containing mice subjected to irradiation.This data set was selected due to the specificity of data.The relevant aspects are the large number of biological and technical repetitions and the possibility of treating samples as if they came from a single organism.The analyzed data set contains twelve repetitions: six biological and two technical repeats performed on five mice from one litter.The analysis conducted on this set is primarily a comparison of results obtained using two methods: with the mean spectrum and without it.
The analysis with the mean spectrum involves standard, described earlier steps: • baseline correction -the operation is necessary due to typical character of the spectrum, • interpolation -standardisation of points on the independent axis, • normalization -reduction of all spectra to one common area under the curve, • the mean spectrum calculation.
The second method used in the analysis does not require the calculation of the spectrum average.In this case the mandatory pre-processing procedure consists of only baseline correction.
Normalization and interpolation, however, were also perform to standardise the data.
The graphs presented in Figure 6 illustrates the results of the decomposition analysis.
Illustrative. Figure 7 shows the placement of individual spectral peaks in the confidence intervals designated for spectrum average.The graph illustrates that the medium of individual peaks in the spectra belong to the respective confidence intervals.This analysis was conducted to test the method and check its repeatability.
Figure 8 shows the box graph describing errors (differences) derived from multiple runs of a decomposition procedure for the same data set for methods based on the mean spectrum.Distances between the obtained M / Z values were considered.

Summary
Comparative analysis of methods using the mean spectrum and without it indicates that the use of the mean spectrum allows for increase efficiency and speeding up the analysis.However, results obtained for both types of analysis are similar.It indicates, that the method is reliable.Analysis also proved a high degree of reproducibility.Results of the decomposition analysis are comparable to the results obtained with other methods [28].Moreover, in case of low level of separability it proved to be more flexible than other method.It also handle the complexity of the spectra.
Presented mass spectra processing method can be used not only for the spectra, whose peaks are narrow and do not overlap.The method also allows modeling of spectrum with a more complex structure with overlapped peaks characterized by a large variance.For these spectra, the methods based on local maxima and minima may fail.This problem is exemplified in the Figure 9.It presents using of mspeaks method (available in Matlab Bioinformatic Toolbox).
This method is based on the local maxima and minima as well as the height of the potential peaks.Those values are used to find the center point of individual peaks.Fig 9a,b present the result of the method for spectra with narrow peaks.Identification of such peaks runs smoothly and well-chosen parameters enable fast and efficient identification of peaks.Fig 9c,d,e shows the use of various configuration options for the problem of the broader spectrum with overlapping peaks.It can be seen that results are characterized with considerable redundancy of the number of identified peaks.

Fig. 3 .
Fig. 3.The flow diagram of the EM algorithm