A STEP TOWARDS THE MAJORITY-BASED CLUSTERING VALIDATION DECISION FUSION METHOD
Taras Panskyi
tpanski@kis.p.lodz.plLodz University of Technology, Institute of Applied Computer Science, Lodz, Poland (Poland)
http://orcid.org/0000-0002-0416-8711
Volodymyr Mosorov
Lodz University of Technology, Lodz, Poland (Poland)
http://orcid.org/0000-0001-6016-8671
Abstract
A variety of clustering validation indices (CVIs) aimed at validating the results of clustering analysis and determining which clustering algorithm performs best. Different validation indices may be appropriate for different clustering algorithms or partition dissimilarity measures; however, the best suitable index to use in practice remains unknown. A single CVI is generally unable to handle the wide variability and scalability of the data and cope successfully with all the contexts. Therefore, one of the popular approaches is to use a combination of multiple CVIs and fuse their votes into the final decision. The aim of this work is to analyze the majority-based decision fusion method. Thus, the experimental work consisted of designing and implementing the NbClust majority-based decision fusion method and then evaluating the CVIs performance with different clustering algorithms and dissimilarity measures in order to discover the best validation configuration. Moreover, the author proposed to enhance the standard majority-based decision fusion method with straightforward rules for the maximum efficiency of the validation procedure. The result showed that the designed enhanced method with an invasive validation configuration could cope with almost all data sets (99%) with different experimental factors (density, dimensionality, number of clusters, etc.).
Keywords:
clustering, clustering validation index, decision fusion methodReferences
Akoglu L., Tong H., Koutra D.: Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery 29(3), 2015, 626–688.
DOI: https://doi.org/10.1007/s10618-014-0365-y
Google Scholar
Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J., Perona I.: An extensive comparative study of cluster validity indices. Pattern Recognition 46(1), 2013, 243–256.
DOI: https://doi.org/10.1016/j.patcog.2012.07.021
Google Scholar
Bailey K.D.: Typologies and Taxonomies: An introduction to classification techniques (quantitative applications in the social sciences). SAGE Publications, Thousand Oaks 1994.
DOI: https://doi.org/10.4135/9781412986397
Google Scholar
Ball G.H., Hall D.J.: ISODATA, a Novel Method of Data Analysis and Pattern Classification. Stanford Research Institute 1965.
Google Scholar
Bandyopadhyay S., Maulik U: Nonparametric genetic clustering: comparison of validity indices. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31(1), 2001, 120–125.
DOI: https://doi.org/10.1109/5326.923275
Google Scholar
Beale E.M.L.: Cluster Analysis. Scientific Control Systems, London 1969.
Google Scholar
Bezdek J., Li W., Attikiouzel Y., Windham M.: A geometric approach to cluster validity for normal mixtures. Soft Computing – A Fusion of Foundations, Methodologies and Applications 1(4), 1997, 166 –179.
DOI: https://doi.org/10.1007/s005000050019
Google Scholar
Bezdek J., Pal N.: Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 28(3), 1998, 301–315.
DOI: https://doi.org/10.1109/3477.678624
Google Scholar
Berkhin P.: A Survey of Clustering Data Mining Techniques. Grouping Multidimensional Data. Springer, Berlin 2006.
Google Scholar
Braune C., Besecke S., Kruse R.: Density Based Clustering: Alternatives to DBSCAN, Partitional Clustering Algorithms. Springer, Cham 2014.
DOI: https://doi.org/10.1007/978-3-319-09259-1_6
Google Scholar
Brock G., Pihur V., Datta S., Datta S.: clValid: An R Package for Cluster Validation. Journal of Statistical Software 25(4), 2008, 1–22.
DOI: https://doi.org/10.18637/jss.v025.i04
Google Scholar
Brun M., Sima C., Hua J., Lowey J., Carroll B., Suh E., Dougherty E.: Model-based evaluation of clustering validation measures. Pattern Recognition 40(3), 2007, 807–824.
DOI: https://doi.org/10.1016/j.patcog.2006.06.026
Google Scholar
Calinski T., Harabasz J.: A dendrite method for cluster analysis. Communications in Statistics – Theory and Methods 3(1), 1974, 1–27.
DOI: https://doi.org/10.1080/03610927408827101
Google Scholar
Cannataro M., Congiusta A., Mastroianni C., Pugliese A., Talia D., Trunfio P.: Grid-Based Data Mining and Knowledge Discovery. Intelligent Technologies for Information Analysis. Springer, Berlin 2004.
DOI: https://doi.org/10.1007/978-3-662-07952-2_2
Google Scholar
Celebi M.: Partitional clustering algorithms. Springer, Cham 2015.
DOI: https://doi.org/10.1007/978-3-319-09259-1
Google Scholar
Charrad M., Ghazzali N., Boiteau V., Niknafs A.: NbClust: AnRPackage for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software 61(6), 2014, 1–36.
DOI: https://doi.org/10.18637/jss.v061.i06
Google Scholar
Cho K., Lee J.: Grid-Based and Outlier Detection-Based Data Clustering and Classification. Communications in Computer and Information Science. Springer, Berlin 2011.
DOI: https://doi.org/10.1007/978-3-642-20975-8_14
Google Scholar
Chou C., Su M., Lai E.: A new cluster validity measure and its application to image compression. Pattern Analysis and Applications 7(2), 2004, 205–220.
DOI: https://doi.org/10.1007/s10044-004-0218-1
Google Scholar
Davies D., Bouldin D.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1(2), 1979, 224–227.
DOI: https://doi.org/10.1109/TPAMI.1979.4766909
Google Scholar
Deng M., Liu Q., Cheng T., Shi Y.: An Adaptive Spatial Clustering Algorithm Based On Delaunay Triangulation. Computers, Environment and Urban Systems 35, 2011, 320–332.
DOI: https://doi.org/10.1016/j.compenvurbsys.2011.02.003
Google Scholar
Dimitriadou E.: cclust: Convex Clustering Methods and Clustering Indexes. R package version 0.6-18, 2014.
Google Scholar
Dimitriadou E., Dolňicar S., Weingessel A.: An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1), 2002, 137–159.
DOI: https://doi.org/10.1007/BF02294713
Google Scholar
Dubes R.: How many clusters are best? – An experiment. Pattern Recognition 20(6), 1987, 645–663.
DOI: https://doi.org/10.1016/0031-3203(87)90034-3
Google Scholar
Duda R., Hart P: Pattern classification and scene analysis. Wiley, New York 1973.
Google Scholar
Duda R, Hart P., Stork D.: Pattern classification. Wiley, New York 2001.
Google Scholar
Dunn J.: Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 4(1), 1974, 95–104.
DOI: https://doi.org/10.1080/01969727408546059
Google Scholar
Embrechts E., Gatti C., Linton J., Roysam B.: Hierarchical Clustering for Large Data Sets. Advances in Intelligent Signal Processing and Data Mining. Springer, Berlin 2013.
DOI: https://doi.org/10.1007/978-3-642-28696-4_8
Google Scholar
Estivill-Castro V., Lee I.: Argument Free Clustering For Large Spatial Point-Data Sets Via Boundary Extraction From Delaunay Diagram. Computers, Environment and Urban Systems 26, 2002, 315–334.
DOI: https://doi.org/10.1016/S0198-9715(01)00044-8
Google Scholar
Fränti P., Mariescu-Istodor R., Zhong C.: XNN Graph, Lecture Notes in Computer Science, 10029, 2016, 207–217.
DOI: https://doi.org/10.1007/978-3-319-49055-7_19
Google Scholar
Frey T., van Groenewoud H.: A Cluster Analysis of the D 2 Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle. The Journal of Ecology 60(3), 1972, 873–886.
DOI: https://doi.org/10.2307/2258571
Google Scholar
Friedman H., Rubin J.: On Some Invariant Criteria for Grouping Data. Journal of the American Statistical Association 62(320), 1967, 1159–1178.
DOI: https://doi.org/10.1080/01621459.1967.10500923
Google Scholar
Granichin O., Volkovich Z., Toledano-Kitai D.: Cluster Validation. Intelligent Systems Reference Library. Springer, Berlin 2015.
DOI: https://doi.org/10.1007/978-3-642-54786-7_7
Google Scholar
Gurrutxaga I., Muguerza J., Arbelaitz O., Pérez J., Martín J.: Towards a standard methodology to evaluate internal cluster validity indices. Pattern Recognition Letters 32(3), 2011, 505–515.
DOI: https://doi.org/10.1016/j.patrec.2010.11.006
Google Scholar
Halim Z., J. Khattak J.: Density-based clustering of big probabilistic graphs. Evolving Systems 10, 2019, 333–350.
DOI: https://doi.org/10.1007/s12530-018-9223-2
Google Scholar
Halkidi M., Batistakis Y., Vazirgiannis M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17(2/3), 2001, 107–145.
DOI: https://doi.org/10.1023/A:1012801612483
Google Scholar
Handl J., Knowles J.: Multi-Objective Clustering and Cluster Validation. Studies in Computational Intelligence. Springer, Berlin 2006.
Google Scholar
Halkidi M., Vazirgiannis M.: A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6), 2008, 773–786.
DOI: https://doi.org/10.1016/j.patrec.2007.12.011
Google Scholar
Halkidi M., Vazirgiannis M.: Clustering validity assessment: finding the optimal partitioning of a data set. Proceedings 2001 IEEE International Conference on Data Mining. IEEE, San Jose 2001.
Google Scholar
Halkidi M., Vazirgiannis M., Batistakis Y.: Quality Scheme Assessment in the Clustering Process. Lecture Notes in Computer Science. Springer, Berlin 2000.
DOI: https://doi.org/10.1007/3-540-45372-5_26
Google Scholar
Hartigan J.A.: Clustering Algorithms. John Wiley & Sons, New York 1975.
Google Scholar
Hennig C.: Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification 4, 2010, 3–34.
DOI: https://doi.org/10.1007/s11634-010-0058-3
Google Scholar
Hornik K.: A CLUE for CLUster Ensembles. Journal of Statistical Software 14(12), 2005, 1–25.
DOI: https://doi.org/10.18637/jss.v014.i12
Google Scholar
Hubert L., Levin J.: A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin 83(6), 1976, 1072–1080.
DOI: https://doi.org/10.1037/0033-2909.83.6.1072
Google Scholar
Kryszczuk K., Hurley P.: Estimation of the Number of Clusters Using Multiple Clustering Validity Indices. Lecture Notes in Computer Science, Springer, Berlin 2010.
DOI: https://doi.org/10.1007/978-3-642-12127-2_12
Google Scholar
Krzanowski W., Lai Y.: A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering. Biometrics 44(1), 1988, 23–34.
DOI: https://doi.org/10.2307/2531893
Google Scholar
Lu J., Zhang G., Ruan D., Wu F.: Multi-objective group decision making: methods, software and applications with fuzzy set techniques. Imperial College Press, London 2007.
DOI: https://doi.org/10.1142/p505
Google Scholar
Maalel W., Zhou K., Martin A., Elouedi Z.: Belief Hierarchical Clustering, Belief Functions: Theory and Applications. Lecture Notes in Computer Science. Springer, Cham 2014.
DOI: https://doi.org/10.1007/978-3-319-11191-9_8
Google Scholar
Marriott F.: Practical Problems in a Method of Cluster Analysis. Biometrics 27(3), 1971, 501–514.
DOI: https://doi.org/10.2307/2528592
Google Scholar
McClain J., Rao V.: CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects. Journal of Marketing Research 12(4), 1975, 456–460.
Google Scholar
Meyer D., Dimitriadou E., Hornik K., Weingessel A., Leisch F.: E1071: Misc Functions of the Department of Statistics, Probability Theory Group. R package version 1.6-8, 2017.
Google Scholar
Milligan G.: An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3), 1980, 325–342.
DOI: https://doi.org/10.1007/BF02293907
Google Scholar
Milligan G., Cooper M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 1985, 159–179.
DOI: https://doi.org/10.1007/BF02294245
Google Scholar
Nerurkar P., Pavate A., Shah M., Jacob S.: Performance of Internal Cluster Validations Measures for Evolutionary Clustering. Advances in Intelligent Systems and Computing. Springer, Singapore 2018.
DOI: https://doi.org/10.1007/978-981-13-1513-8_105
Google Scholar
Nieweglowski L.: clv: Cluster Validation Techniques. R package version 0.3-2.1, 2014.
Google Scholar
Oliveira J., Pedrycz W.: Advances in fuzzy clustering and its applications. John Wiley & Sons Ltd, Chichester 2007.
Google Scholar
Peng Q., Wang Y., Ou G., Tian Y., Huang L., Pang W.: Partitioning Clustering Based on Support Vector Ranking. Lecture Notes in Computer Science. Springer, Cham 2016.
DOI: https://doi.org/10.1007/978-3-319-49586-6_52
Google Scholar
Ratkowsky D.A., Lance G.N.: A Criterion for Determining the Number of Groups in a Classification. Australian Computer Journal 10(3), 1978, 115–117.
Google Scholar
Rezaei M., Fränti P.: Set Matching Measures for External Cluster Validity. IEEE Transactions on Knowledge and Data Engineering 28(8), 2016, 2173–2186.
DOI: https://doi.org/10.1109/TKDE.2016.2551240
Google Scholar
Rousseeuw P.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 1987, 53–65.
DOI: https://doi.org/10.1016/0377-0427(87)90125-7
Google Scholar
Roux M.: A Comparative Study of Divisive and Agglomerative Hierarchical Clustering Algorithms. Journal of Classification 35(2), 2018, 345–366.
DOI: https://doi.org/10.1007/s00357-018-9259-9
Google Scholar
Sarle W.S.: Cubic Clustering Criterion, SAS Technical Report A-108. SAS Institute Inc, Cary 1983.
Google Scholar
Saemi B., Hosseinabadi A., Kardgar M., Balas V., Ebadi H.: Nature Inspired Partitioning Clustering Algorithms: A Review and Analysis. Advances in Intelligent Systems and Computing. Springer, Cham 2017.
DOI: https://doi.org/10.1007/978-3-319-62524-9_9
Google Scholar
Scott A., Symons M.: Clustering Methods Based on Likelihood Ratio Criteria. Biometrics 27(2), 1971, 387–397.
DOI: https://doi.org/10.2307/2529003
Google Scholar
Shim Y., Chung J., Choi I.: A Comparison Study of Cluster Validity Indices Using a Nonhierarchical Clustering Algorithm. International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06). IEEE, Vienna 2005.
Google Scholar
Steinley D., Henson R.: OCLUS: An Analytic Method for Generating Clusters with Known Overlap. Journal of Classification 22(2), 2005, 221–250.
DOI: https://doi.org/10.1007/s00357-005-0015-6
Google Scholar
Tan P., Steinbach M., Kumar V.: Introduction to data mining. Pearson, 2005.
Google Scholar
Vathy-Fogarassy A., Abonyi J.: Graph-Based Clustering and Data Visualization Algorithms. Springer, London 2013.
DOI: https://doi.org/10.1007/978-1-4471-5158-6
Google Scholar
Walesiak M., Dudek A.: clusterSim: Searching for Optimal Clustering Procedure for a Data Set. R package version 0.43-4, 2014.
Google Scholar
Yera A., Arbelaitz O., Jodra J., Gurrutxaga I., Pérez J., Muguerza J.: Analysis of several decision fusion strategies for clustering validation. Strategy definition, experiments and validation. Pattern Recognition Letters 85, 2017, 42–48.
DOI: https://doi.org/10.1016/j.patrec.2016.11.009
Google Scholar
Zahn C.: Graph-Theoretical Methods For Detecting And Describing Gestalt Clusters. IEEE Transactions on Computers C-20, 1971, 68–86.
DOI: https://doi.org/10.1109/T-C.1971.223083
Google Scholar
Žalik K., Žalik B.: Validity index for clusters of different sizes and densities. Pattern Recognition Letters 32(2), 2011, 221–234.
DOI: https://doi.org/10.1016/j.patrec.2010.08.007
Google Scholar
Zhong C., Miao D., Wang R.: A Graph-Theoretical Clustering Method Based On Two Rounds Of Minimum Spanning Trees. Pattern Recognition 43, 2010, 752–766.
DOI: https://doi.org/10.1016/j.patcog.2009.07.010
Google Scholar
Authors
Taras Panskyitpanski@kis.p.lodz.pl
Lodz University of Technology, Institute of Applied Computer Science, Lodz, Poland Poland
http://orcid.org/0000-0002-0416-8711
Authors
Volodymyr MosorovLodz University of Technology, Lodz, Poland Poland
http://orcid.org/0000-0001-6016-8671
Statistics
Abstract views: 360PDF downloads: 284
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Most read articles by the same author(s)
- Fatma Mbarek, Volodymyr Mosorov, Rafał Wojciechowski, WEB SERVER LATENCY REDUCTION STUDY , Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska: Vol. 7 No. 3 (2017)