A STEP TOWARDS THE MAJORITY-BASED CLUSTERING VALIDATION DECISION FUSION METHOD

Taras Panskyi

tpanski@kis.p.lodz.pl
Lodz University of Technology, Institute of Applied Computer Science, Lodz, Poland (Poland)
http://orcid.org/0000-0002-0416-8711

Volodymyr Mosorov


Lodz University of Technology, Lodz, Poland (Poland)
http://orcid.org/0000-0001-6016-8671

Abstract

A variety of clustering validation indices (CVIs) aimed at validating the results of clustering analysis and determining which clustering algorithm performs best. Different validation indices may be appropriate for different clustering algorithms or partition dissimilarity measures; however, the best suitable index to use in practice remains unknown. A single CVI is generally unable to handle the wide variability and scalability of the data and cope successfully with all the contexts. Therefore, one of the popular approaches is to use a combination of multiple CVIs and fuse their votes into the final decision. The aim of this work is to analyze the majority-based decision fusion method. Thus, the experimental work consisted of designing and implementing the NbClust majority-based decision fusion method and then evaluating the CVIs performance with different clustering algorithms and dissimilarity measures in order to discover the best validation configuration. Moreover, the author proposed to enhance the standard majority-based decision fusion method with straightforward rules for the maximum efficiency of the validation procedure. The result showed that the designed enhanced method with an invasive validation configuration could cope with almost all data sets (99%) with different experimental factors (density, dimensionality, number of clusters, etc.).


Keywords:

clustering, clustering validation index, decision fusion method

Akoglu L., Tong H., Koutra D.: Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery 29(3), 2015, 626–688.
DOI: https://doi.org/10.1007/s10618-014-0365-y   Google Scholar

Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J., Perona I.: An extensive comparative study of cluster validity indices. Pattern Recognition 46(1), 2013, 243–256.
DOI: https://doi.org/10.1016/j.patcog.2012.07.021   Google Scholar

Bailey K.D.: Typologies and Taxonomies: An introduction to classification techniques (quantitative applications in the social sciences). SAGE Publications, Thousand Oaks 1994.
DOI: https://doi.org/10.4135/9781412986397   Google Scholar

Ball G.H., Hall D.J.: ISODATA, a Novel Method of Data Analysis and Pattern Classification. Stanford Research Institute 1965.
  Google Scholar

Bandyopadhyay S., Maulik U: Nonparametric genetic clustering: comparison of validity indices. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31(1), 2001, 120–125.
DOI: https://doi.org/10.1109/5326.923275   Google Scholar

Beale E.M.L.: Cluster Analysis. Scientific Control Systems, London 1969.
  Google Scholar

Bezdek J., Li W., Attikiouzel Y., Windham M.: A geometric approach to cluster validity for normal mixtures. Soft Computing – A Fusion of Foundations, Methodologies and Applications 1(4), 1997, 166 –179.
DOI: https://doi.org/10.1007/s005000050019   Google Scholar

Bezdek J., Pal N.: Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 28(3), 1998, 301–315.
DOI: https://doi.org/10.1109/3477.678624   Google Scholar

Berkhin P.: A Survey of Clustering Data Mining Techniques. Grouping Multidimensional Data. Springer, Berlin 2006.
  Google Scholar

Braune C., Besecke S., Kruse R.: Density Based Clustering: Alternatives to DBSCAN, Partitional Clustering Algorithms. Springer, Cham 2014.
DOI: https://doi.org/10.1007/978-3-319-09259-1_6   Google Scholar

Brock G., Pihur V., Datta S., Datta S.: clValid: An R Package for Cluster Validation. Journal of Statistical Software 25(4), 2008, 1–22.
DOI: https://doi.org/10.18637/jss.v025.i04   Google Scholar

Brun M., Sima C., Hua J., Lowey J., Carroll B., Suh E., Dougherty E.: Model-based evaluation of clustering validation measures. Pattern Recognition 40(3), 2007, 807–824.
DOI: https://doi.org/10.1016/j.patcog.2006.06.026   Google Scholar

Calinski T., Harabasz J.: A dendrite method for cluster analysis. Communications in Statistics – Theory and Methods 3(1), 1974, 1–27.
DOI: https://doi.org/10.1080/03610927408827101   Google Scholar

Cannataro M., Congiusta A., Mastroianni C., Pugliese A., Talia D., Trunfio P.: Grid-Based Data Mining and Knowledge Discovery. Intelligent Technologies for Information Analysis. Springer, Berlin 2004.
DOI: https://doi.org/10.1007/978-3-662-07952-2_2   Google Scholar

Celebi M.: Partitional clustering algorithms. Springer, Cham 2015.
DOI: https://doi.org/10.1007/978-3-319-09259-1   Google Scholar

Charrad M., Ghazzali N., Boiteau V., Niknafs A.: NbClust: AnRPackage for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software 61(6), 2014, 1–36.
DOI: https://doi.org/10.18637/jss.v061.i06   Google Scholar

Cho K., Lee J.: Grid-Based and Outlier Detection-Based Data Clustering and Classification. Communications in Computer and Information Science. Springer, Berlin 2011.
DOI: https://doi.org/10.1007/978-3-642-20975-8_14   Google Scholar

Chou C., Su M., Lai E.: A new cluster validity measure and its application to image compression. Pattern Analysis and Applications 7(2), 2004, 205–220.
DOI: https://doi.org/10.1007/s10044-004-0218-1   Google Scholar

Davies D., Bouldin D.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1(2), 1979, 224–227.
DOI: https://doi.org/10.1109/TPAMI.1979.4766909   Google Scholar

Deng M., Liu Q., Cheng T., Shi Y.: An Adaptive Spatial Clustering Algorithm Based On Delaunay Triangulation. Computers, Environment and Urban Systems 35, 2011, 320–332.
DOI: https://doi.org/10.1016/j.compenvurbsys.2011.02.003   Google Scholar

Dimitriadou E.: cclust: Convex Clustering Methods and Clustering Indexes. R package version 0.6-18, 2014.
  Google Scholar

Dimitriadou E., Dolňicar S., Weingessel A.: An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1), 2002, 137–159.
DOI: https://doi.org/10.1007/BF02294713   Google Scholar

Dubes R.: How many clusters are best? – An experiment. Pattern Recognition 20(6), 1987, 645–663.
DOI: https://doi.org/10.1016/0031-3203(87)90034-3   Google Scholar

Duda R., Hart P: Pattern classification and scene analysis. Wiley, New York 1973.
  Google Scholar

Duda R, Hart P., Stork D.: Pattern classification. Wiley, New York 2001.
  Google Scholar

Dunn J.: Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 4(1), 1974, 95–104.
DOI: https://doi.org/10.1080/01969727408546059   Google Scholar

Embrechts E., Gatti C., Linton J., Roysam B.: Hierarchical Clustering for Large Data Sets. Advances in Intelligent Signal Processing and Data Mining. Springer, Berlin 2013.
DOI: https://doi.org/10.1007/978-3-642-28696-4_8   Google Scholar

Estivill-Castro V., Lee I.: Argument Free Clustering For Large Spatial Point-Data Sets Via Boundary Extraction From Delaunay Diagram. Computers, Environment and Urban Systems 26, 2002, 315–334.
DOI: https://doi.org/10.1016/S0198-9715(01)00044-8   Google Scholar

Fränti P., Mariescu-Istodor R., Zhong C.: XNN Graph, Lecture Notes in Computer Science, 10029, 2016, 207–217.
DOI: https://doi.org/10.1007/978-3-319-49055-7_19   Google Scholar

Frey T., van Groenewoud H.: A Cluster Analysis of the D 2 Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle. The Journal of Ecology 60(3), 1972, 873–886.
DOI: https://doi.org/10.2307/2258571   Google Scholar

Friedman H., Rubin J.: On Some Invariant Criteria for Grouping Data. Journal of the American Statistical Association 62(320), 1967, 1159–1178.
DOI: https://doi.org/10.1080/01621459.1967.10500923   Google Scholar

Granichin O., Volkovich Z., Toledano-Kitai D.: Cluster Validation. Intelligent Systems Reference Library. Springer, Berlin 2015.
DOI: https://doi.org/10.1007/978-3-642-54786-7_7   Google Scholar

Gurrutxaga I., Muguerza J., Arbelaitz O., Pérez J., Martín J.: Towards a standard methodology to evaluate internal cluster validity indices. Pattern Recognition Letters 32(3), 2011, 505–515.
DOI: https://doi.org/10.1016/j.patrec.2010.11.006   Google Scholar

Halim Z., J. Khattak J.: Density-based clustering of big probabilistic graphs. Evolving Systems 10, 2019, 333–350.
DOI: https://doi.org/10.1007/s12530-018-9223-2   Google Scholar

Halkidi M., Batistakis Y., Vazirgiannis M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17(2/3), 2001, 107–145.
DOI: https://doi.org/10.1023/A:1012801612483   Google Scholar

Handl J., Knowles J.: Multi-Objective Clustering and Cluster Validation. Studies in Computational Intelligence. Springer, Berlin 2006.
  Google Scholar

Halkidi M., Vazirgiannis M.: A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6), 2008, 773–786.
DOI: https://doi.org/10.1016/j.patrec.2007.12.011   Google Scholar

Halkidi M., Vazirgiannis M.: Clustering validity assessment: finding the optimal partitioning of a data set. Proceedings 2001 IEEE International Conference on Data Mining. IEEE, San Jose 2001.
  Google Scholar

Halkidi M., Vazirgiannis M., Batistakis Y.: Quality Scheme Assessment in the Clustering Process. Lecture Notes in Computer Science. Springer, Berlin 2000.
DOI: https://doi.org/10.1007/3-540-45372-5_26   Google Scholar

Hartigan J.A.: Clustering Algorithms. John Wiley & Sons, New York 1975.
  Google Scholar

Hennig C.: Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification 4, 2010, 3–34.
DOI: https://doi.org/10.1007/s11634-010-0058-3   Google Scholar

Hornik K.: A CLUE for CLUster Ensembles. Journal of Statistical Software 14(12), 2005, 1–25.
DOI: https://doi.org/10.18637/jss.v014.i12   Google Scholar

Hubert L., Levin J.: A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin 83(6), 1976, 1072–1080.
DOI: https://doi.org/10.1037/0033-2909.83.6.1072   Google Scholar

Kryszczuk K., Hurley P.: Estimation of the Number of Clusters Using Multiple Clustering Validity Indices. Lecture Notes in Computer Science, Springer, Berlin 2010.
DOI: https://doi.org/10.1007/978-3-642-12127-2_12   Google Scholar

Krzanowski W., Lai Y.: A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering. Biometrics 44(1), 1988, 23–34.
DOI: https://doi.org/10.2307/2531893   Google Scholar

Lu J., Zhang G., Ruan D., Wu F.: Multi-objective group decision making: methods, software and applications with fuzzy set techniques. Imperial College Press, London 2007.
DOI: https://doi.org/10.1142/p505   Google Scholar

Maalel W., Zhou K., Martin A., Elouedi Z.: Belief Hierarchical Clustering, Belief Functions: Theory and Applications. Lecture Notes in Computer Science. Springer, Cham 2014.
DOI: https://doi.org/10.1007/978-3-319-11191-9_8   Google Scholar

Marriott F.: Practical Problems in a Method of Cluster Analysis. Biometrics 27(3), 1971, 501–514.
DOI: https://doi.org/10.2307/2528592   Google Scholar

McClain J., Rao V.: CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects. Journal of Marketing Research 12(4), 1975, 456–460.
  Google Scholar

Meyer D., Dimitriadou E., Hornik K., Weingessel A., Leisch F.: E1071: Misc Functions of the Department of Statistics, Probability Theory Group. R package version 1.6-8, 2017.
  Google Scholar

Milligan G.: An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45(3), 1980, 325–342.
DOI: https://doi.org/10.1007/BF02293907   Google Scholar

Milligan G., Cooper M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 1985, 159–179.
DOI: https://doi.org/10.1007/BF02294245   Google Scholar

Nerurkar P., Pavate A., Shah M., Jacob S.: Performance of Internal Cluster Validations Measures for Evolutionary Clustering. Advances in Intelligent Systems and Computing. Springer, Singapore 2018.
DOI: https://doi.org/10.1007/978-981-13-1513-8_105   Google Scholar

Nieweglowski L.: clv: Cluster Validation Techniques. R package version 0.3-2.1, 2014.
  Google Scholar

Oliveira J., Pedrycz W.: Advances in fuzzy clustering and its applications. John Wiley & Sons Ltd, Chichester 2007.
  Google Scholar

Peng Q., Wang Y., Ou G., Tian Y., Huang L., Pang W.: Partitioning Clustering Based on Support Vector Ranking. Lecture Notes in Computer Science. Springer, Cham 2016.
DOI: https://doi.org/10.1007/978-3-319-49586-6_52   Google Scholar

Ratkowsky D.A., Lance G.N.: A Criterion for Determining the Number of Groups in a Classification. Australian Computer Journal 10(3), 1978, 115–117.
  Google Scholar

Rezaei M., Fränti P.: Set Matching Measures for External Cluster Validity. IEEE Transactions on Knowledge and Data Engineering 28(8), 2016, 2173–2186.
DOI: https://doi.org/10.1109/TKDE.2016.2551240   Google Scholar

Rousseeuw P.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 1987, 53–65.
DOI: https://doi.org/10.1016/0377-0427(87)90125-7   Google Scholar

Roux M.: A Comparative Study of Divisive and Agglomerative Hierarchical Clustering Algorithms. Journal of Classification 35(2), 2018, 345–366.
DOI: https://doi.org/10.1007/s00357-018-9259-9   Google Scholar

Sarle W.S.: Cubic Clustering Criterion, SAS Technical Report A-108. SAS Institute Inc, Cary 1983.
  Google Scholar

Saemi B., Hosseinabadi A., Kardgar M., Balas V., Ebadi H.: Nature Inspired Partitioning Clustering Algorithms: A Review and Analysis. Advances in Intelligent Systems and Computing. Springer, Cham 2017.
DOI: https://doi.org/10.1007/978-3-319-62524-9_9   Google Scholar

Scott A., Symons M.: Clustering Methods Based on Likelihood Ratio Criteria. Biometrics 27(2), 1971, 387–397.
DOI: https://doi.org/10.2307/2529003   Google Scholar

Shim Y., Chung J., Choi I.: A Comparison Study of Cluster Validity Indices Using a Nonhierarchical Clustering Algorithm. International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06). IEEE, Vienna 2005.
  Google Scholar

Steinley D., Henson R.: OCLUS: An Analytic Method for Generating Clusters with Known Overlap. Journal of Classification 22(2), 2005, 221–250.
DOI: https://doi.org/10.1007/s00357-005-0015-6   Google Scholar

Tan P., Steinbach M., Kumar V.: Introduction to data mining. Pearson, 2005.
  Google Scholar

Vathy-Fogarassy A., Abonyi J.: Graph-Based Clustering and Data Visualization Algorithms. Springer, London 2013.
DOI: https://doi.org/10.1007/978-1-4471-5158-6   Google Scholar

Walesiak M., Dudek A.: clusterSim: Searching for Optimal Clustering Procedure for a Data Set. R package version 0.43-4, 2014.
  Google Scholar

Yera A., Arbelaitz O., Jodra J., Gurrutxaga I., Pérez J., Muguerza J.: Analysis of several decision fusion strategies for clustering validation. Strategy definition, experiments and validation. Pattern Recognition Letters 85, 2017, 42–48.
DOI: https://doi.org/10.1016/j.patrec.2016.11.009   Google Scholar

Zahn C.: Graph-Theoretical Methods For Detecting And Describing Gestalt Clusters. IEEE Transactions on Computers C-20, 1971, 68–86.
DOI: https://doi.org/10.1109/T-C.1971.223083   Google Scholar

Žalik K., Žalik B.: Validity index for clusters of different sizes and densities. Pattern Recognition Letters 32(2), 2011, 221–234.
DOI: https://doi.org/10.1016/j.patrec.2010.08.007   Google Scholar

Zhong C., Miao D., Wang R.: A Graph-Theoretical Clustering Method Based On Two Rounds Of Minimum Spanning Trees. Pattern Recognition 43, 2010, 752–766.
DOI: https://doi.org/10.1016/j.patcog.2009.07.010   Google Scholar

Download


Published
2021-06-30

Cited by

Panskyi, T., & Mosorov, V. (2021). A STEP TOWARDS THE MAJORITY-BASED CLUSTERING VALIDATION DECISION FUSION METHOD. Informatyka, Automatyka, Pomiary W Gospodarce I Ochronie Środowiska, 11(2), 4–13. https://doi.org/10.35784/iapgos.2596

Authors

Taras Panskyi 
tpanski@kis.p.lodz.pl
Lodz University of Technology, Institute of Applied Computer Science, Lodz, Poland Poland
http://orcid.org/0000-0002-0416-8711

Authors

Volodymyr Mosorov 

Lodz University of Technology, Lodz, Poland Poland
http://orcid.org/0000-0001-6016-8671

Statistics

Abstract views: 369
PDF downloads: 284


Most read articles by the same author(s)

1 2 > >>