Formal analysis of order in the local structure of the nucleotide sequences
The definition of the chain order and integral characteristics of the order, in particular, for nucleotide sequences were presented in the previous papers. These characteristics showed the high sensitivity to the arrangement of components. The possibility of comparison, classification and hashing based on the introduced formalisms and using characteristics to order have been considered. The approach developed by the authors allows displaying the local structure of sign sequences of arbitrary nature by numerical sequences that represent the arrangement of their components. The generally accepted method of studying large arrays of measurement data, linguistic texts, nucleotide sequences, and long sequences of another nature is the «window scan». This paper describes means for analysis of the local structure of complete full-length sequences based on the characteristics of the order of separate fragments (L-grams), named the functions of characteristics of order. The formulas for calculation of the values of these functions are of the following form: Ду = Xj^^y ^ij; %i+ij' £ [5 * k,s * k + m i fcik'l'S = ^^(о^Ду' y=i ;=i f g(k'l'S)= ' ' / l, , U > /Д B(k'l,s)= |ППД'' л) J f r(k,l,s)=fb g(k'l'S)/f D(k'l'S) ' where x^ is a number of position of i-th occurrence of j-th element of alphabet on position of current fragment; k is a number of fragment; s is a step size (when s=1 fragments become L-grams); l is a window length; f G(kJ'S) is a depth function; f a(kJ'S) is an average remoteness function; f Ag(k, l,s) is an average geometric interval function; f r(k, l,s) is a regularity function; f D(k, l,s) is a descriptive information function. A larger window allows detecting fragments with similar order of greater length. Increasing the length of the fragments results in function values provides tending to a value of corresponding integral characteristic of the full length sequence. Reducing the length of the fragments allows using separate functions values for detection of more detailed features of the arrangement of components within the window. Preliminary studies showed that the relationship between the window length and the dispersion of the characteristic values is hyperbolic. However, if the window length is reduced to the cardinality of the alphabet (m = 4), this dependence is violated. Thus, as expected, the uncertainty of the location of the fragment is associated with the uncertainty of the function values obtained for a given window length. Selection of an optimal window length for various tasks, including, dependence on the cardinality of the alphabet of the original sequence, requires additional research. Software for calculating and displaying the functions of characteristics of order is developed and tested on ribosomal RNA of several organisms. Research revealed the influence of fragment (L-gram) length on the shape of functions of characteristics of order. The possibility of using the functions of the characteristics of order for finding similar or overlapping fragments in one or more sequences is considered, as well as - the inverse problem - finding of occurrences of the specified fragments in the complete genome sequence. Displaying the order of nucleotide sequences with functions, besides noted means also allows using the classical methods of mathematics, such as: mathematical analysis, spectral analysis, correlation analysis, etc., that would be impossible with the direct analysis of symbolic sequences themselves. It is noted that the graphical representation of functions of characteristics of order allows carrying expert analysis of long nucleotide sequences, including complete genome sequences.
Keywords
строй цепи, нуклеотидная последовательность, характеристики строя, функции характеристик строя, L-граммы, локальная структура нуклеотидной цепи, chain's order, sequence, nucleotide sequence, order characteristics, functions of order characteristics, L-grams, local structure of nucleotide chainAuthors
Name | Organization | |
Gumenyuk Alexander S. | Omsk State Technical University | gumas45@mail.ru |
Pozdnichenko Nikolay N. | Omsk State Technical University | nick670@yandex.ru |
Shpynov Stanislav N. | Gamaleya Institute of Epidemiology and Microbiology | stan63@inbox.ru |
References

Formal analysis of order in the local structure of the nucleotide sequences | Tomsk State University Journal of Control and Computer Science. 2014. № 4(29).