Comparative Analysis of Distance Measures in Bug Report Clustering using Agglomerative Hierarchical Clustering

Krisnawan Hartanto; Suprapto Suprapto

doi:10.56873/jitu.9.1.6065

Authors

Krisnawan Hartanto School of Multi Media, Indonesia
Suprapto Suprapto

DOI:

https://doi.org/10.56873/jitu.9.1.6065

Keywords:

Agglomerative Hierarchical Clustering, AHC, Bug Report, Cluster, IDF

Abstract

Grouping bug reports into clusters can assist in verifying and validating bugs in the software development cycle. One of the clustering methods is Agglomerative Hierarchical Clustering (AHC). It relies on distance calculations to determine the degree of similarity between clusters. One of the distance calculations is the Jaccard coefficient. The Jaccard Coefficient method has the disadvantage that it only considers the same set of words between two documents but does not consider their importance. Previous research added Inverse Document Frequency (IDF) algorithm to the Jaccard coefficient to calculate the importance of word groups and in this research is referred to the weighted Jaccard coefficient. Clustering is carried out using a combination of AHC and that coefficient. The silhouette score is then compared with the silhouette score of AHC with the Jaccard coefficient. Results indicate that increasing term complexity reduces cluster quality, with silhouette scores dropping from 13.13% (bigram) to 0.45% (4-gram). Furthermore, many clusters exhibited negative silhouette scores, highlighting the difficulty of separating high-dimensional bug data using unsupervised methods. In contrast, the supervised classification baseline achieved significantly higher accuracy. This paper contributes a critical analysis demonstrating that while Weighted Jaccard captures semantic nuance, unsupervised clustering remains insufficient for this domain compared to supervised approaches.

References

[1] N. Limsettho, H. Hata, A. Monden, and K. Matsumoto, â€œUnsupervised Bug Report Categorization Using Clustering and Labeling Algorithm,â€ International Journal of Software Engineering and Knowledge Engineering, vol. 26, no. 7, pp. 1027â€“1053, 2016, doi: 10.1142/S0218194016500352.

[2] V. Venkatesh and A. S. Vindhya, "Predicting the Accuracy of Fractionation of Patron's Activities in Online Social Networks Using Novel K-Means Clustering Algorithm Comparing with Agglomerative Hierarchical Clustering Algorithm," 2023 6th International Conference on Contemporary Computing and Informatics (IC3I), Gautam Buddha Nagar, India, 2023, pp. 2563-2566, doi: 10.1109/IC3I59117.2023.10397982.

[3] S. G. Jindal and A. Kaur, "Automatic Keyword and Sentence-Based Text Summarization for Software Bug Reports," in IEEE Access, vol. 8, pp. 65352-65370, 2020, doi: 10.1109/ACCESS.2020.2985222.

[4] M. I. Nawaz Tarar, F. Ahmed and W. H. Butt, "Automated Summarization of Bug Reports to speed-up software development/maintenance process by using Natural Language Processing (NLP)," 2020 15th International Conference on Computer Science & Education (ICCSE), Delft, Netherlands, 2020, pp. 483-488, doi: 10.1109/ICCSE49874.2020.9201846.

[5] S. Mukhtar, S. Lee and J. Heo, "A Multidocument Summarization Technique for Informative Bug Summaries," in IEEE Access, vol. 12, pp. 158908-158926, 2024, doi: 10.1109/ACCESS.2024.3487443.

[6] M. Wati, D. Adela and M. Jamil, "Implementation of Hierarchical Agglomerative Clustering Method to East Kalimantan Unemployment Analysis," 2023 IEEE 7th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), Purwokerto, Indonesia, 2023, pp. 395-399, doi: 10.1109/ICITISEE58992.2023.10405078.

[7] P. Patel, B. Sivaiah and R. Patel, "Approaches for finding Optimal Number of Clusters using K-Means and Agglomerative Hierarchical Clustering Techniques," 2022 International Conference on Intelligent Controller and Computing for Smart Power (ICICCSP), Hyderabad, India, 2022, pp. 1-6, doi: 10.1109/ICICCSP53532.2022.9862439.

[8] S. Galdino and J. Dias, "Interval-valued Data Ward's Hierarchical Agglomerative Clustering Method: Comparison of Three Representative Merge Points," 2021 International Conference on Engineering and Emerging Technologies (ICEET), Istanbul, Turkey, 2021, pp. 1-6, doi: 10.1109/ICEET53442.2021.9659628.

[9] W. Dai and M. Zhang, "Hierarchical Agglomerative Clustering Optimization for Massive Data," 2024 IEEE Smart World Congress (SWC), Nadi, Fiji, 2024, pp. 2514-2519, doi: 10.1109/SWC62898.2024.00379.

[10] N. Anuar, N. K. K. Baharin, N. H. M. Nizam, A. N. Fadzilah, S. E. M. Nazri and N. M. Lip, "Determination of Typical Electricity Load Profile by Using Double Clustering of Fuzzy C-Means and Hierarchical Method," 2021 IEEE 12th Control and System Graduate Research Colloquium (ICSGRC), Shah Alam, Malaysia, 2021, pp. 277-280, doi: 10.1109/ICSGRC53186.2021.9515295.

[11] T. Yuan, J. Zhou and Y. Chen, "Identification of Industrial Process Time Series Events Based on Agglomerative Hierarchical Clustering*," 2023 China Automation Congress (CAC), Chongqing, China, 2023, pp. 6495-6499, doi: 10.1109/CAC59555.2023.10451693.

[12] F. -q. Meng, R. Huang and J. -d. Wang, "Clustering for Bug Triage Based on Developer Work Capabilities," 2025 11th International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, India, 2025, pp. 1743-1748, doi: 10.1109/ICCSP64183.2025.11088794.

[13] C. E. Laney, A. Barovic and A. Moin, "Automated Duplicate Bug Report Detection in Large Open Bug Repositories," 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC), Toronto, ON, Canada, 2025, pp. 450-458, doi: 10.1109/COMPSAC65507.2025.00065.

[14] S. Guo et al., â€œDeveloper Activity Motivated Bug Triaging: Via Convolutional Neural Network,â€ Neural Processing Letters, vol. 51, no. 3, pp. 2589â€“2606, Jun. 2020, doi: 10.1007/s11063-020-10213-y.

[15] C. Wu and B. Wang, â€œExtracting Topics Based on Word2Vec and Improved Jaccard Similarity Coefficient,â€ Proceedings - 2017 IEEE 2nd International Conference on Data Science in Cyberspace, DSC 2017, pp. 389â€“397, 2017, doi: 10.1109/DSC.2017.70.

[16] S. Bandyopadhyay and S. Saha, Unsupervised classification: Similarity measures, classical and metaheuristic approaches, and applications, vol. 9783642324512. Springer-Verlag Berlin Heidelberg, 2013. doi: 10.1007/978-3-642-32451-2.

[17] T. Jo, Text Mining: Concepts, Implementation, and Big Data Challenge, vol. 45. Cham: Springer International Publishing, 2019. doi: https://doi.org/10.1007/978-3-319-91815-0.

Comparative Analysis of Distance Measures in Bug Report Clustering using Agglomerative Hierarchical Clustering

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

The proposed policy for journals that offer open access

How to Cite

Similar Articles

tentang jurnal

download_template

citation

traffic

whatapp