Serial and complex description of the order of components in data sets
The paper describes the concept of «order of data sequence», which is defined as a special kind of tuple - «vector of order». The components of order are integer numbers i that are not more than its length n; first encountered different numbers j < m < n are increasing by one. The works representing the means of formal description and analysis of order of data sequences, considered such long tuples (symbolic sequences, data sets), in which separate different components throughout the chain nearly always alternate, and the series (of the same elements arranged in row) are rare and short. Computer processing of large «text» data sets (prose, poems, musical compositions, nucleotide sequences) showed high sensitivity of characteristics of order to the arrangement of components (letters, words, notes, etc.) in long and very long tuples. The proposed means suggest representation of symbolic sequence with its order. The result of decomposition of order is congeneric sequences in each of which places occupied by similar elements (in amount of nj) are marked with integer number and all other positions are empty. In these congeneric sequences interval between the nearest occupied positions is used as basic measure and calculated as Aj - number of empty positions plus one. This paper discusses means for analysis of ordered data sets, which are mainly represented by alternating series of identical messages. This may be, for example, digitized images, sequences of measured values, etc. These means are represented by a set of «serial» characteristics of order, which are defined, named, marked in a similar way to the previously introduced «interval» characteristics of order. Series length is used as basic measured value and its size is calculated as a number of occupied places in a row in the j-th congeneric sequence. The length of series is marked Tj (i is the number of element in the n-th partial congeneric sequence). Below, some of the serial characteristics of the system are given. Serial volume of the j'-th congeneric sequence is defined as (1), and serial volume of the complete sequence is defined as (2): у*, =m , (1) V =п "Л- (2) The total spread of all series of the '-th congeneric sequence is defined as (3), and the total spread of all the series in the complete sequence is defined as (4): Lj = nj · log2 Tgj , (3) L = n • log2 xg . (4) where % is the geometric mean length of the series in the j'-th congeneric seqience; Tg is the geometric mean length of the series in the complete sequence; lj = log2% is the average spread of the series in the j'-th congeneric sequence; l = log2Tg is the average spread of the series in the complete sequence. The «complete» serial description of an order is defined by the following distributions: {Lj}, {
Keywords
строй цепи, числовые характеристики строя, удаленность сообщения, протяженность серии, межнуклеотидное расстояние, order, order characteristics, remoteness, spread of series, inter-nucleotide distanceAuthors
Name | Organization | |
Gumenyuk Alexander S. | Omsk State Technical University | gumas45@mail.ru |
References

Serial and complex description of the order of components in data sets | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2018. № 42. DOI: 10.17223/19988605/42/4