Adaptive Estimation of Weight Coefficient in a Time-weighted Incremental EM-algorithm for Data Streams
Application of data stream clustering is quite general. Applications of data streams include mining data generated by sensor networks, meteorological analysis, stock market analysis, computer network traffic monitoring, long-term population-based, cognitive, real-time sociological studies etc. These applications involve datasets that are far too large to fit in main memory and are typically stored in a secondary storage device. Real-time processing of large volumes of data requires efficient, fast algorithms and data compression. Moreover, recently obtained objects are often more important than the old ones, so several clustering algorithms implement a damped window model with a user-defined decay coefficient. This coefficient is not obvious. An important task in the development of machine learning algorithms is to reduce the number of algorithm parameters (user-defined parameter) through the development of adaptive mechanisms. Such a mechanism for determining the decay coefficient proposed in this paper. A Gaussian mixture data stream algorithm is constructed with damped window model. Decay coefficient for the object according damped window is w=w(At) where At is time since object arrived from the stream into cluster, weight of the cluster W(t) is sum of all its objects. Recent objects receive higher weight than older ones, and the weights of the objects decrease with time. It is usually to assume w(At)=e (a>0), with following properties: 1) when arrives, the object weight is equal to one: w(0)=1; 2) over time object weight monotonic decreases to zero: lim w(At) = 0; At 3) if weight of object w(A^) after At! units of time is known, then after more At2 it can be easily recalculates as w(Ati+At2) = w(Atj)w(At2); 4) if weight of cluster W(t) for the instant in time t is known, and during the next time period At there was no new objects arrived in this cluster, then weight of this cluster can be recalculates as W(t+At) = W(t)w(At). Let at initial instant in time we have a set of K Gaussian clusters Ck that described by their means ц and correlation matrixes Xk. Weight Wk at initial time defined as a number of objects in cluster Ck. Let {x1, x2, . ,.}eR is data stream, i.e. set of objects with timestamps t1, t2, .... When a new object x with timestamp t arrives to the cluster Ck, the following algorithm executes. Input: new object x, cluster parameters: mean correlation matrix X, weight Wk (k=1,2,...K), time period since last object arrives At. 1. Recalculate weights Wk = eWk (k=1,2,...,K). Wk 9k ( x\ цk, Z K ) 2. Calculate probabilities of belonging x to the k-th cluster щ =--~i~\-. Put x into the most probable cluster. ZmWФ* (( Mi'' Z ) 3. For this cluster (in with x is, index of cluster is dropped) recalculate mean ц, correlation matrix X and weight W: W+ W (|~- x) ^ 2~+- W=W+1, w+1 i+ = T ~ ; 2+ = w+1 w+1 where indexes - and + corresponds to the values before and after recalculation. Note that weight function depends on a>0, that may be defined individually for each Ck. We propose the following algorithm to recalculate a separately for each cluster at the time the object arrives into cluster. Input: xi, x2,...,xN - last N objects putted into cluster, t1, ,tN - their timestamps, |0 - cluster mean Z0 - cluster correlation matrix. 1. At initial instance of time t0 adopt a=0, i.e. we assume that cluster is not moving. 2. When object xi with timestamp ti (i1 + x2 + ...+ xN)/Nand ft-t0). v = (||0) _ 3. If i=N calculate ^n ( ) using accumulated variables. Calculate a = -| v |ln e/r, where r is Euclidean distance be- 2 i=1i 0 tween |0 and confidence ellipse boundary (confidence probability p = 1 - e) in direction of v vector. Set t0=tN, assign |0 and Z0 a corresponding values of cluster at the time tN and return to step 1. We perform an experiment using an imitation model of Gaussian mixture clusters with moving centroids. Some results are shown in this article. Proposed algorithm is undemanding to resources (time, memory) and therefore is suitable for real-time monitoring in large dynamic systems, such as computer systems and networks. Quality of decay coefficient adaptation is acceptable as follows from the experimental data.
Keywords
алгоритм кластеризации,
поток данных,
системы реального времени,
затухающая оконная модель,
адаптивный алгоритм,
dustering,
time-weight,
data stream,
Gaussian mixture model,
damped window modelAuthors
Nissenbaum Olga V. | Tyumen State University | onissenbaum@rambler.ru |
Kharchenko Anastasia M. | Tyumen State University | kharchenkoam.86@gmail.com |
Всего: 2
References
Ding S., Wu F., Qian J., Jia H., Jin F. Research on data stream clustering algorithms // Artif Intell Rev. 2015. V. 43. P. 593-600.
Munro J., Paterson M. Selection and Sorting with Limited Storage // Theoretical Computer Science. 1980. Р. 315-323.
Henzinger M., Raghavan P., Rajagopalan S. Computing on Data Streams // Digital Equipment Corporation. SRC TN-1998-011. Au gust 1998.
Ackerman M., Martens R., Raupach C., Swierkot K., Lammersen C., Sohler C. StreamKM++: A clustering algorithm for data streams // ACM J. Exper. Algor. 2012. V. 17, No. 1.
Aggarwal C.C. A segment-based framework for modeling and mining data streams // Knowl. Inf. Syst. 2010. V. 30, No. 1. P. 1-29.
Csernel B., Clerot F., Hebrail G. Data stream clustering over tilted windows through sampling // Proceedings of the Knowledge Dis covery from Data Streams Workshop (ECML/PKDD). 2006.
Dang X.H., Lee V.C.S., Ng W.K., Ciptadi A., Ong K.-L. An EM-based algorithm for clustering data streams in sliding windows // Proceedings of the 14th International Conference on Database Systems for Advanced Applications. Lecture Notes in Computer Science. Springer, 2009. V. 5463. P. 230-235.
Khalilian M., Mustapha N. Data stream clustering: challenges and issues // Proceedings of International Multi Conference of Engi neers and Computer Scientists. 2010. P. 566-569.
Li Y., Tan B.H. Data stream clustering algorithm based on affinity propagation and density // Advanced Materials Res. 2011. V. 267. P. 444-449.
Ong K.L, Li W., Ng W.-K., Lim E.-P. SCLOPE: An algorithm for clustering data streams of categorical attributes // Proceedings of the 6th International Conference on Data Warehousing and Knowledge Discovery (KDD'04). 2004. P. 209-218.
Silva J.A., Hruschka E.R. Extending k-means-based algorithms for evolving data streams with variable number of clusters // Pro ceedings of the 4th International Conference on Machine Learning and Applications (ICMLA'11). 2011. V. 2. P. 14-19.
Cao F., Zhou A.Y. Fast clustering of data streams using graphics processors // Journal of Software. 2007. V. 18, No. 2. P. 291-302.
Zhu W.H., Yin J., Xie Y.H. Arbitrary shape cluster algorithm for clustering data stream // Journal of Software. 2006. V. 17, No. 3. P. 379-387.
Chandrika J., Ananda Kumar K.R. Dynamic Clustering Of High Speed Data Streams // International Journal of Computer Science Issues. 2012. V. 9, Issue 2, No. 1. P. 224-228. На рис. 3 приведена визуализация одного из 30 результатов. Эллипсами обозначены доверительные области кластеров с вероятностью 0,9 (пунктирная линия - истинные, сплошная - полученные в предложенном алгоритме) в момент окончания эксперимента.
Vorontsov K., Frei O., Apishev M., Romov P., Suvorova M., Yanina A. Non-bayesian additive regularization for multimodal topic modeling of large collections // Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications. 2015. P. 29-37.
Spirin N., Vorontsov K. Learning to rank with nonlinear monotonic ensemble // International Workshop on Multiple Classifier Sys tems: Springer Berlin Heidelberg, 2011. P. 16-25.
Qian Quan, Chao-Jie Xiao, Rui Zhang. Grid-based Data Stream Clustering for Intrusion Detection // International Journal of Net work Security. 2013. V. 15, No. 1. P. 1-8.
Daniel A., Sternberg, Kacey Ballard, Joseph L. Hardey at al. The Largest Human Cognitive Performance Dataset Reveals Insights into the Effects of Lifestyle Factors and Aging // Frontiers in Human Neuroscience. 2013. V. 7. Article 292.
Zhang X., Sebag M., Germain-Renaud C. Multi-scale real-time grid monitoring with job stream mining // Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'09). 2009. P. 420-427.
Shurygin A.M., Strigunova M.S. Prediction of strong earthquakes with the use of scalar product of matrices with a sliding window // Pattern Recognition and Image Analysis (Advances in Mathematiccal Theory and Applications). 2007. V. 17, No. 1. P. 153-162.
Ниссенбаум О.В., Присяжнюк А.С. Адаптивный алгоритм отслеживания аномальной активности в компьютерной сети на основании характерных изменений оценок альтернирующего потока // Прикладная дискретная математика. Приложение. 2010. № 3. С. 55-58.
Ниссенбаум О.В. Алгоритм кластеризации потока данных с изменяющимися параметрами распределения // Вестник Тю менского государственного университета. 2013. № 7. С. 180-186.
Cao F., Ester M., Qian W., Zhou A. Density-based clustering over an evolving data stream with noise // Proceedings of the 6th SI- AM International Conference on Data Mining. 2006. P. 328-339.
Mingzhou Song, Hongbin Wang. Highly efficient incremental estimation of Gaussian mixture models for online data stream cluster ing // Proceedings of SPIE 5803. 2005. P. 174-183.