To question of the statistical analysis of big data | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2018. № 44. DOI: 10.17223/19988605/44/5

To question of the statistical analysis of big data

The search for regularities in Big Data now receives more and more scientific attention. Naturally, the methods for parameter estimation of models and criteria for testing hypotheses from classical mathematical statistic are applied to achieve these aims. At the same time, it is found that well-established methods of evaluation become ineffective because of the "dimensional curse". Most criteria for testing statistical hypotheses are suitable for samples analysis of very limited dimension. The criteria that can be formally used for sizes samples n ^ да in practice lead to an unjustified rejection of the hypothesis being tested. In estimation methods that operate with ungrouped data, as the dimension of the analyzed samples increases, the computational costs dramatically increase, the convergence of the iterative algorithms used in the construction of estimates worsens. The non-robustness of the estimates turns out an essential factor. The reason that excludes the possibility of applying many tests for testing hypotheses to Big Data samples is the dependence of the statistics distribution of these tests on n and the availability only short tables of critical values. At reasonable sizes of n, this reason can be eliminated by computer technology and statistical simulating methods to find the empirical distribution of statistics necessary for making a decision. The reason for incorrect conclusions when using tests with known limit distributions of statistics is that sizes n in Big Data are "practically unlimited" and these data are presented with limited accuracy. For a fixed number of intervals with an increase in sample sizes, the computational costs for parameters estimation by grouped data do not change, but increase only with an increase in the number of k intervals. It is recommended that maximum likelihood estimates (MLE) were used for grouped samples. These are robust and asymptotically efficient estimates. For small k, the quality of estimates can be improved using asymptotically optimal grouping, in which the losses in Fisher information associated with grouping are minimized. Using the example of a Big Data sample, the dependence of the result of applying %2 Pearson's criterion for testing a simple and complex hypothesis is shown correspondence on the number of intervals and the method of grouping. It is shown that there are no obstacles to the application of %2 Pearson's criterion to large samples, and it retains both its positive qualities and its inherent disadvantages (conclusions are ambiguous, essentially depend on the number of intervals chosen and on the method of grouping). Statistical distributions of goodness-of-fit tests (Kolmogorov, Cramer-Mises-Smirnov and Anderson-Darling) were studied by statistical simulating methods, depending on the accuracy of the observation record (from the possible number of unique values in the samples). From obtained results, it follows that when analysis of Big Data using the appropriate non-parametric goodness-of-fit tests, the statistics should not be computed over the entire large array, but on samples extracted from the "general population", whose role in this case is played by the Big Data array being analyzed. The size of the sample to be extracted should take into account the accuracy of the data to be captured (the number of possible unique values in the sample) and not exceed some value of nmax , at which (for a given accuracy) the distribution of the statistic G(Snm^ |Нз) of the tests for the validity of the hypothesis H0 does not differ from the limiting distribution G(S|H0) of this statistic. The presented results allow to estimate nmax values for the considered tests. The estimates of nmax for the Kolmogorov test are substantially lower than for the Cramer-Mises-Smirnov and Anderson-Darling tests. The obtained estimates nmax can apply for using in the goodness-of-fit tests by the Big Data analysis.

Download file
Counter downloads: 184

Keywords

Big Data, оценивание параметров, проверка гипотез, критерии согласия, Big Data, parameter estimation, hypothesis testing, goodness-of-fit test

Authors

NameOrganizationE-mail
Lemeshko Boris YurievichNovosibirsk State Technical UniversityLemeshko@ami.nstu.ru
Lemeshko Stanislav BorisovichNovosibirsk State Technical Universityskyer@mail.ru
Semenova Mariya AlexandrovnaNovosibirsk State Technical Universityvedernikova.m.a@gmail.com
Всего: 3

References

Лемешко Б.Ю. Непараметрические критерии согласия : руководство по применению. М. : ИНФРА-М, 2014. 163 с. DOI: 10.12737/11873.
Рао. С.Р. Линейные статистические методы и их применения. М. : Наука, 1968. 548 с.
Лемешко Б.Ю. Группирование наблюдений как способ получения робастных оценок // Надежность и контроль качества. 1997. № 5. С. 26-35.
Куллдорф Г. Введение в теорию оценивания по группированным и частично группированным выборкам. М. : Наука, 1966. 176 с.
Денисов В.И., Лемешко Б.Ю., Цой Е.Б. Оптимальное группирование, оценка параметров и планирование регрессионных экспериментов : в 2 ч. / Новосиб. гос. техн. ун-т. Новосибирск, 1993. 347 с.
Статистический анализ данных, моделирование и исследование вероятностных закономерностей. Компьютерный подход / Б.Ю. Лемешко, С.Б. Лемешко, С.Н. Постовалов, Е.В. Чимитова. Новосибирск : Изд-во НГТУ, 2011. 888 с.
Никулин М.С. О критерии хи-квадрат для непрерывных распределений // Теория вероятностей и ее применение. 1973. Т. XVIII, № 3. С. 75-676.
Rao K.C., Robson D.S. A chi-squared statistic for goodness-of-fit tests within the exponential family // Commun. Statist. 1974. V. 3. P. 1139-1153.
Денисов В.И., Лемешко Б.Ю. Оптимальное группирование при обработке экспериментальных данных // Измерительные информационные системы. Новосибирск, 1979. С. 5-14.
Лемешко Б.Ю. Асимптотически оптимальное группирование наблюдений - это обеспечение максимальной мощности критериев // Надежность и контроль качества. 1997. № 8. С. 3-14.
Лемешко Б.Ю. Асимптотически оптимальное группирование наблюдений в критериях согласия // Заводская лаборатория. 1998. Т. 64, № 1. С. 56-64.
Лемешко Б.Ю., Чимитова Е.В. О выборе числа интервалов в критериях согласия типа %2 // Заводская лаборатория. Диагностика материалов. 2003. Т. 69, № 1. С. 61-67.
Лемешко Б.Ю. Критерии проверки отклонения распределения от нормального закона : руководство по применению. М. : ИНФРА-М, 2015. 160 с. DOI: 10.12737/6086.
Лемешко Б.Ю. Критерии согласия типа хи-квадрат при проверке нормальности // Измерительная техника. 2015. № 6. С. 3-9.
Большев Л.Н., Смирнов Н.В. Таблицы математической статистики. М. : Наука, 1983. 416 с.
Anderson T.W., Darling D.A. A test of goodness of fit // J. Amer. Statist. Assoc. 1954. V. 29. P. 765-769.
Kac M., Kiefer J., Wolfowitz J. On tests of normality and other J. tests of goodness of fit based on distance methods // Ann. Math. Stat. 1955. V. 26. P. 189-211.
Лемешко Б.Ю., Лемешко С.Б. Модели распределений статистик непараметрических критериев согласия при проверке сложных гипотез с использованием оценок максимального правдоподобия. Ч. I // Измерительная техника. 2009. № 6. С. 3-11.
Лемешко Б.Ю., Лемешко С.Б. Модели распределений статистик непараметрических критериев согласия при проверке сложных гипотез с использованием оценок максимального правдоподобия. Ч. II // Измерительная техника. 2009. № 8. С. 17-26.
Lemeshko B.Yu., Lemeshko S.B., Postovalov S.N. Statistic Distribution Models for Some Nonparametric Goodness-of-Fit Tests in Testing Composite Hypotheses // Communications in Statistics - Theory and Methods. 2010. V. 39, No. 3. P. 460-471.
 To question of the statistical analysis of big data | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2018. № 44. DOI: 10.17223/19988605/44/5

To question of the statistical analysis of big data | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2018. № 44. DOI: 10.17223/19988605/44/5

Download full-text version
Counter downloads: 652