To question of the statistical analysis of big data
The search for regularities in Big Data now receives more and more scientific attention. Naturally, the methods for parameter estimation of models and criteria for testing hypotheses from classical mathematical statistic are applied to achieve these aims. At the same time, it is found that well-established methods of evaluation become ineffective because of the "dimensional curse". Most criteria for testing statistical hypotheses are suitable for samples analysis of very limited dimension. The criteria that can be formally used for sizes samples n ^ да in practice lead to an unjustified rejection of the hypothesis being tested. In estimation methods that operate with ungrouped data, as the dimension of the analyzed samples increases, the computational costs dramatically increase, the convergence of the iterative algorithms used in the construction of estimates worsens. The non-robustness of the estimates turns out an essential factor. The reason that excludes the possibility of applying many tests for testing hypotheses to Big Data samples is the dependence of the statistics distribution of these tests on n and the availability only short tables of critical values. At reasonable sizes of n, this reason can be eliminated by computer technology and statistical simulating methods to find the empirical distribution of statistics necessary for making a decision. The reason for incorrect conclusions when using tests with known limit distributions of statistics is that sizes n in Big Data are "practically unlimited" and these data are presented with limited accuracy. For a fixed number of intervals with an increase in sample sizes, the computational costs for parameters estimation by grouped data do not change, but increase only with an increase in the number of k intervals. It is recommended that maximum likelihood estimates (MLE) were used for grouped samples. These are robust and asymptotically efficient estimates. For small k, the quality of estimates can be improved using asymptotically optimal grouping, in which the losses in Fisher information associated with grouping are minimized. Using the example of a Big Data sample, the dependence of the result of applying %2 Pearson's criterion for testing a simple and complex hypothesis is shown correspondence on the number of intervals and the method of grouping. It is shown that there are no obstacles to the application of %2 Pearson's criterion to large samples, and it retains both its positive qualities and its inherent disadvantages (conclusions are ambiguous, essentially depend on the number of intervals chosen and on the method of grouping). Statistical distributions of goodness-of-fit tests (Kolmogorov, Cramer-Mises-Smirnov and Anderson-Darling) were studied by statistical simulating methods, depending on the accuracy of the observation record (from the possible number of unique values in the samples). From obtained results, it follows that when analysis of Big Data using the appropriate non-parametric goodness-of-fit tests, the statistics should not be computed over the entire large array, but on samples extracted from the "general population", whose role in this case is played by the Big Data array being analyzed. The size of the sample to be extracted should take into account the accuracy of the data to be captured (the number of possible unique values in the sample) and not exceed some value of nmax , at which (for a given accuracy) the distribution of the statistic G(Snm^ |Нз) of the tests for the validity of the hypothesis H0 does not differ from the limiting distribution G(S|H0) of this statistic. The presented results allow to estimate nmax values for the considered tests. The estimates of nmax for the Kolmogorov test are substantially lower than for the Cramer-Mises-Smirnov and Anderson-Darling tests. The obtained estimates nmax can apply for using in the goodness-of-fit tests by the Big Data analysis.
Keywords
Big Data, оценивание параметров, проверка гипотез, критерии согласия, Big Data, parameter estimation, hypothesis testing, goodness-of-fit testAuthors
Name | Organization | |
Lemeshko Boris Yurievich | Novosibirsk State Technical University | Lemeshko@ami.nstu.ru |
Lemeshko Stanislav Borisovich | Novosibirsk State Technical University | skyer@mail.ru |
Semenova Mariya Alexandrovna | Novosibirsk State Technical University | vedernikova.m.a@gmail.com |
References

To question of the statistical analysis of big data | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2018. № 44. DOI: 10.17223/19988605/44/5