A STEP TOWARDS THE MAJORITY-BASED CLUSTERING VALIDATION DECISION FUSION METHOD


Abstract

A variety of clustering validation indices (CVIs) aimed at validating the results of clustering analysis and determining which clustering algorithm performs best. Different validation indices may be appropriate for different clustering algorithms or partition dissimilarity measures; however, the best suitable index to use in practice remains unknown. A single CVI is generally unable to handle the wide variability and scalability of the data and cope successfully with all the contexts. Therefore, one of the popular approaches is to use a combination of multiple CVIs and fuse their votes into the final decision. The aim of this work is to analyze the majority-based decision fusion method. Thus, the experimental work consisted of designing and implementing the NbClust majority-based decision fusion method and then evaluating the CVIs performance with different clustering algorithms and dissimilarity measures in order to discover the best validation configuration. Moreover, the author proposed to enhance the standard majority-based decision fusion method with straightforward rules for the maximum efficiency of the validation procedure. The result showed that the designed enhanced method with an invasive validation configuration could cope with almost all data sets (99%) with different experimental factors (density, dimensionality, number of clusters, etc.).


Keywords

clustering; clustering validation index; decision fusion method

Akoglu L., Tong H., Koutra D.: Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery 29(3), 2015, 626–688. DOI: https://doi.org/10.1007/s10618-014-0365-y

Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J., Perona I.: An extensive comparative study of cluster validity indices. Pattern Recognition 46(1), 2013, 243–256. DOI: https://doi.org/10.1016/j.patcog.2012.07.021

Bailey K.D.: Typologies and Taxonomies: An introduction to classification techniques (quantitative applications in the social sciences). SAGE Publications, Thousand Oaks 1994. DOI: https://doi.org/10.4135/9781412986397

Ball G.H., Hall D.J.: ISODATA, a Novel Method of Data Analysis and Pattern Classification. Stanford Research Institute 1965.

Bandyopadhyay S., Maulik U: Nonparametric genetic clustering: comparison of validity indices. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31(1), 2001, 120–125. DOI: https://doi.org/10.1109/5326.923275

Beale E.M.L.: Cluster Analysis. Scientific Control Systems, London 1969.

Bezdek J., Li W., Attikiouzel Y., Windham M.: A geometric approach to cluster validity for normal mixtures. Soft Computing – A Fusion of Foundations, Methodologies and Applications 1(4), 1997, 166 –179. DOI: https://doi.org/10.1007/s005000050019

Bezdek J., Pal N.: Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 28(3), 1998, 301–315. DOI: https://doi.org/10.1109/3477.678624

Berkhin P.: A Survey of Clustering Data Mining Techniques. Grouping Multidimensional Data. Springer, Berlin 2006.

Braune C., Besecke S., Kruse R.: Density Based Clustering: Alternatives to DBSCAN, Partitional Clustering Algorithms. Springer, Cham 2014. DOI: https://doi.org/10.1007/978-3-319-09259-1_6

Brock G., Pihur V., Datta S., Datta S.: clValid: An R Package for Cluster Validation. Journal of Statistical Software 25(4), 2008, 1–22. DOI: https://doi.org/10.18637/jss.v025.i04

Brun M., Sima C., Hua J., Lowey J., Carroll B., Suh E., Dougherty E.: Model-based evaluation of clustering validation measures. Pattern Recognition 40(3), 2007, 807–824. DOI: https://doi.org/10.1016/j.patcog.2006.06.026

Calinski T., Harabasz J.: A dendrite method for cluster analysis. Communications in Statistics – Theory and Methods 3(1), 1974, 1–27. DOI: https://doi.org/10.1080/03610927408827101

Cannataro M., Congiusta A., Mastroianni C., Pugliese A., Talia D., Trunfio P.: Grid-Based Data Mining and Knowledge Discovery. Intelligent Technologies for Information Analysis. Springer, Berlin 2004. DOI: https://doi.org/10.1007/978-3-662-07952-2_2

Celebi M.: Partitional clustering algorithms. Springer, Cham 2015. DOI: https://doi.org/10.1007/978-3-319-09259-1

Charrad M., Ghazzali N., Boiteau V., Niknafs A.: NbClust: AnRPackage for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software 61(6), 2014, 1–36. DOI: https://doi.org/10.18637/jss.v061.i06

Cho K., Lee J.: Grid-Based and Outlier Detection-Based Data Clustering and Classification. Communications in Computer and Information Science. Springer, Berlin 2011. DOI: https://doi.org/10.1007/978-3-642-20975-8_14

Chou C., Su M., Lai E.: A new cluster validity measure and its application to image compression. Pattern Analysis and Applications 7(2), 2004, 205–220. DOI: https://doi.org/10.1007/s10044-004-0218-1

Davies D., Bouldin D.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1(2), 1979, 224–227. DOI: https://doi.org/10.1109/TPAMI.1979.4766909

Deng M., Liu Q., Cheng T., Shi Y.: An Adaptive Spatial Clustering Algorithm Based On Delaunay Triangulation. Computers, Environment and Urban Systems 35, 2011, 320–332. DOI: https://doi.org/10.1016/j.compenvurbsys.2011.02.003

Dimitriadou E.: cclust: Convex Clustering Methods and Clustering Indexes. R package version 0.6-18, 2014.

Dimitriadou E., Dolňicar S., Weingessel A.: An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1), 2002, 137–159. DOI: https://doi.org/10.1007/BF02294713

Dubes R.: How many clusters are best? – An experiment. Pattern Recognition 20(6), 1987, 645–663. DOI: https://doi.org/10.1016/0031-3203(87)90034-3

Duda R., Hart P: Pattern classification and scene analysis. Wiley, New York 1973.

Duda R, Hart P., Stork D.: Pattern classification. Wiley, New York 2001.

Dunn J.: Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 4(1), 1974, 95–104. DOI: https://doi.org/10.1080/01969727408546059

Embrechts E., Gatti C., Linton J., Roysam B.: Hierarchical Clustering for Large Data Sets. Advances in Intelligent Signal Processing and Data Mining. Springer, Berlin 2013. DOI: https://doi.org/10.1007/978-3-642-28696-4_8

Estivill-Castro V., Lee I.: Argument Free Clustering For Large Spatial Point-Data Sets Via Boundary Extraction From Delaunay Diagram. Computers, Environment and Urban Systems 26, 2002, 315–334. DOI: https://doi.org/10.1016/S0198-9715(01)00044-8

Fränti P., Mariescu-Istodor R., Zhong C.: XNN Graph, Lecture Notes in Computer Science, 10029, 2016, 207–217. DOI: https://doi.org/10.1007/978-3-319-49055-7_19

Frey T., van Groenewoud H.: A Cluster Analysis of the D 2 Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle. The Journal of Ecology 60(3), 1972, 873–886. DOI: https://doi.org/10.2307/2258571

Friedman H., Rubin J.: On Some Invariant Criteria for Grouping Data. Journal of the American Statistical Association 62(320), 1967, 1159–1178. DOI: https://doi.org/10.1080/01621459.1967.10500923

Granichin O., Volkovich Z., Toledano-Kitai D.: Cluster Validation. Intelligent Systems Reference Library. Springer, Berlin 2015. DOI: https://doi.org/10.1007/978-3-642-54786-7_7

Gurrutxaga I., Muguerza J., Arbelaitz O., Pérez J., Martín J.: Towards a standard methodology to evaluate internal cluster validity indices. Pattern Recognition Letters 32(3), 2011, 505–515. DOI: https://doi.org/10.1016/j.patrec.2010.11.006

Halim Z., J. Khattak J.: Density-based clustering of big probabilistic graphs. Evolving Systems 10, 2019, 333–350. DOI: https://doi.org/10.1007/s12530-018-9223-2

Halkidi M., Batistakis Y., Vazirgiannis M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17(2/3), 2001, 107–145. DOI: https://doi.org/10.1023/A:1012801612483

Handl J., Knowles J.: Multi-Objective Clustering and Cluster Validation. Studies in Computational Intelligence. Springer, Berlin 2006.

Halkidi M., Vazirgiannis M.: A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6), 2008, 773–786. DOI: https://doi.org/10.1016/j.patrec.2007.12.011

Halkidi M., Vazirgiannis M.: Clustering validity assessment: finding the optimal partitioning of a data set. Proceedings 2001 IEEE International Conference on Data Mining. IEEE, San Jose 2001.

Halkidi M., Vazirgiannis M., Batistakis Y.: Quality Scheme Assessment in the Clustering Process. Lecture Notes in Computer Science. Springer, Berlin 2000. DOI: https://doi.org/10.1007/3-540-45372-5_26

Hartigan J.A.: Clustering Algorithms. John Wiley & Sons, New York 1975.

Hennig C.: Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification 4, 2010, 3–34. DOI: https://doi.org/10.1007/s11634-010-0058-3

Hornik K.: A CLUE for CLUster Ensembles. Journal of Statistical Software 14(12), 2005, 1–25. DOI: https://doi.org/10.18637/jss.v014.i12

Hubert L., Levin J.: A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin 83(6), 1976, 1072–1080. DOI: https://doi.org/10.1037/0033-2909.83.6.1072

Kryszczuk K., Hurley P.: Estimation of the Number of Clusters Using Multiple Clustering Validity Indices. Lecture Notes in Computer Science, Springer, Berlin 2010. DOI: https://doi.org/10.1007/978-3-642-12127-2_12

Krzanowski W., Lai Y.: A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering. Biometrics 44(1), 1988, 23–34. DOI: https://doi.org/10.2307/2531893

Lu J., Zhang G., Ruan D., Wu F.: Multi-objective group decision making: methods, software and applications with fuzzy set techniques. Imperial College Press, London 2007. DOI: https://doi.org/10.1142/p505

Maalel W., Zhou K., Martin A., Elouedi Z.: Belief Hierarchical Clustering, Belief Functions: Theory and Applications. Lecture Notes in Computer Science. Springer, Cham 2014. DOI: https://doi.org/10.1007/978-3-319-11191-9_8

Marriott F.: Practical Problems in a Method of Cluster Analysis. Biometrics 27(3), 1971, 501–514. DOI: https://doi.org/10.2307/2528592

McClain J., Rao V.: CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects. Journal of Marketing Research 12(4), 1975, 456–460.

Meyer D., Dimitriadou E., Hornik K., Weingessel A., Leisch F.: E1071: Misc Functions of the Department of Statistics, Probability Theory Group. R package version 1.6-8, 2017.

Milligan G.: An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3), 1980, 325–342. DOI: https://doi.org/10.1007/BF02293907

Milligan G., Cooper M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 1985, 159–179. DOI: https://doi.org/10.1007/BF02294245

Nerurkar P., Pavate A., Shah M., Jacob S.: Performance of Internal Cluster Validations Measures for Evolutionary Clustering. Advances in Intelligent Systems and Computing. Springer, Singapore 2018. DOI: https://doi.org/10.1007/978-981-13-1513-8_105

Nieweglowski L.: clv: Cluster Validation Techniques. R package version 0.3-2.1, 2014.

Oliveira J., Pedrycz W.: Advances in fuzzy clustering and its applications. John Wiley & Sons Ltd, Chichester 2007.

Peng Q., Wang Y., Ou G., Tian Y., Huang L., Pang W.: Partitioning Clustering Based on Support Vector Ranking. Lecture Notes in Computer Science. Springer, Cham 2016. DOI: https://doi.org/10.1007/978-3-319-49586-6_52

Ratkowsky D.A., Lance G.N.: A Criterion for Determining the Number of Groups in a Classification. Australian Computer Journal 10(3), 1978, 115–117.

Rezaei M., Fränti P.: Set Matching Measures for External Cluster Validity. IEEE Transactions on Knowledge and Data Engineering 28(8), 2016, 2173–2186. DOI: https://doi.org/10.1109/TKDE.2016.2551240

Rousseeuw P.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 1987, 53–65. DOI: https://doi.org/10.1016/0377-0427(87)90125-7

Roux M.: A Comparative Study of Divisive and Agglomerative Hierarchical Clustering Algorithms. Journal of Classification 35(2), 2018, 345–366. DOI: https://doi.org/10.1007/s00357-018-9259-9

Sarle W.S.: Cubic Clustering Criterion, SAS Technical Report A-108. SAS Institute Inc, Cary 1983.

Saemi B., Hosseinabadi A., Kardgar M., Balas V., Ebadi H.: Nature Inspired Partitioning Clustering Algorithms: A Review and Analysis. Advances in Intelligent Systems and Computing. Springer, Cham 2017. DOI: https://doi.org/10.1007/978-3-319-62524-9_9

Scott A., Symons M.: Clustering Methods Based on Likelihood Ratio Criteria. Biometrics 27(2), 1971, 387–397. DOI: https://doi.org/10.2307/2529003

Shim Y., Chung J., Choi I.: A Comparison Study of Cluster Validity Indices Using a Nonhierarchical Clustering Algorithm. International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06). IEEE, Vienna 2005.

Steinley D., Henson R.: OCLUS: An Analytic Method for Generating Clusters with Known Overlap. Journal of Classification 22(2), 2005, 221–250. DOI: https://doi.org/10.1007/s00357-005-0015-6

Tan P., Steinbach M., Kumar V.: Introduction to data mining. Pearson, 2005.

Vathy-Fogarassy A., Abonyi J.: Graph-Based Clustering and Data Visualization Algorithms. Springer, London 2013. DOI: https://doi.org/10.1007/978-1-4471-5158-6

Walesiak M., Dudek A.: clusterSim: Searching for Optimal Clustering Procedure for a Data Set. R package version 0.43-4, 2014.

Yera A., Arbelaitz O., Jodra J., Gurrutxaga I., Pérez J., Muguerza J.: Analysis of several decision fusion strategies for clustering validation. Strategy definition, experiments and validation. Pattern Recognition Letters 85, 2017, 42–48. DOI: https://doi.org/10.1016/j.patrec.2016.11.009

Zahn C.: Graph-Theoretical Methods For Detecting And Describing Gestalt Clusters. IEEE Transactions on Computers C-20, 1971, 68–86. DOI: https://doi.org/10.1109/T-C.1971.223083

Žalik K., Žalik B.: Validity index for clusters of different sizes and densities. Pattern Recognition Letters 32(2), 2011, 221–234. DOI: https://doi.org/10.1016/j.patrec.2010.08.007

Zhong C., Miao D., Wang R.: A Graph-Theoretical Clustering Method Based On Two Rounds Of Minimum Spanning Trees. Pattern Recognition 43, 2010, 752–766. DOI: https://doi.org/10.1016/j.patcog.2009.07.010

Download

Published : 2021-06-30


Panskyi, T., & Mosorov, V. (2021). A STEP TOWARDS THE MAJORITY-BASED CLUSTERING VALIDATION DECISION FUSION METHOD. Informatyka, Automatyka, Pomiary W Gospodarce I Ochronie Środowiska, 11(2), 4-13. https://doi.org/10.35784/iapgos.2596

Taras Panskyi  tpanski@kis.p.lodz.pl
Lodz University of Technology, Institute of Applied Computer Science, Lodz, Poland  Poland
http://orcid.org/0000-0002-0416-8711
Volodymyr Mosorov 
Lodz University of Technology, Lodz, Poland  Poland
http://orcid.org/0000-0001-6016-8671