Comparative Analysis of Distance Measures in Bug Report Clustering using Agglomerative Hierarchical Clustering
DOI:
https://doi.org/10.56873/jitu.9.1.6065Keywords:
Agglomerative Hierarchical Clustering, AHC, Bug Report, Cluster, IDFAbstract
Grouping bug reports into clusters can assist in verifying and validating bugs in the software development cycle. One of the clustering methods is Agglomerative Hierarchical Clustering (AHC). It relies on distance calculations to determine the degree of similarity between clusters. One of the distance calculations is the Jaccard coefficient. The Jaccard Coefficient method has the disadvantage that it only considers the same set of words between two documents but does not consider their importance. Previous research added Inverse Document Frequency (IDF) algorithm to the Jaccard coefficient to calculate the importance of word groups and in this research is referred to the weighted Jaccard coefficient. Clustering is carried out using a combination of AHC and that coefficient. The silhouette score is then compared with the silhouette score of AHC with the Jaccard coefficient. Results indicate that increasing term complexity reduces cluster quality, with silhouette scores dropping from 13.13% (bigram) to 0.45% (4-gram). Furthermore, many clusters exhibited negative silhouette scores, highlighting the difficulty of separating high-dimensional bug data using unsupervised methods. In contrast, the supervised classification baseline achieved significantly higher accuracy. This paper contributes a critical analysis demonstrating that while Weighted Jaccard captures semantic nuance, unsupervised clustering remains insufficient for this domain compared to supervised approaches.
References
[1] N. Limsettho, H. Hata, A. Monden, and K. Matsumoto, “Unsupervised Bug Report Categorization Using Clustering and Labeling Algorithm,†International Journal of Software Engineering and Knowledge Engineering, vol. 26, no. 7, pp. 1027–1053, 2016, doi: 10.1142/S0218194016500352.
[2] V. Venkatesh and A. S. Vindhya, "Predicting the Accuracy of Fractionation of Patron's Activities in Online Social Networks Using Novel K-Means Clustering Algorithm Comparing with Agglomerative Hierarchical Clustering Algorithm," 2023 6th International Conference on Contemporary Computing and Informatics (IC3I), Gautam Buddha Nagar, India, 2023, pp. 2563-2566, doi: 10.1109/IC3I59117.2023.10397982.
[3] S. G. Jindal and A. Kaur, "Automatic Keyword and Sentence-Based Text Summarization for Software Bug Reports," in IEEE Access, vol. 8, pp. 65352-65370, 2020, doi: 10.1109/ACCESS.2020.2985222.
[4] M. I. Nawaz Tarar, F. Ahmed and W. H. Butt, "Automated Summarization of Bug Reports to speed-up software development/maintenance process by using Natural Language Processing (NLP)," 2020 15th International Conference on Computer Science & Education (ICCSE), Delft, Netherlands, 2020, pp. 483-488, doi: 10.1109/ICCSE49874.2020.9201846.
[5] S. Mukhtar, S. Lee and J. Heo, "A Multidocument Summarization Technique for Informative Bug Summaries," in IEEE Access, vol. 12, pp. 158908-158926, 2024, doi: 10.1109/ACCESS.2024.3487443.
[6] M. Wati, D. Adela and M. Jamil, "Implementation of Hierarchical Agglomerative Clustering Method to East Kalimantan Unemployment Analysis," 2023 IEEE 7th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), Purwokerto, Indonesia, 2023, pp. 395-399, doi: 10.1109/ICITISEE58992.2023.10405078.
[7] P. Patel, B. Sivaiah and R. Patel, "Approaches for finding Optimal Number of Clusters using K-Means and Agglomerative Hierarchical Clustering Techniques," 2022 International Conference on Intelligent Controller and Computing for Smart Power (ICICCSP), Hyderabad, India, 2022, pp. 1-6, doi: 10.1109/ICICCSP53532.2022.9862439.
[8] S. Galdino and J. Dias, "Interval-valued Data Ward's Hierarchical Agglomerative Clustering Method: Comparison of Three Representative Merge Points," 2021 International Conference on Engineering and Emerging Technologies (ICEET), Istanbul, Turkey, 2021, pp. 1-6, doi: 10.1109/ICEET53442.2021.9659628.
[9] W. Dai and M. Zhang, "Hierarchical Agglomerative Clustering Optimization for Massive Data," 2024 IEEE Smart World Congress (SWC), Nadi, Fiji, 2024, pp. 2514-2519, doi: 10.1109/SWC62898.2024.00379.
[10] N. Anuar, N. K. K. Baharin, N. H. M. Nizam, A. N. Fadzilah, S. E. M. Nazri and N. M. Lip, "Determination of Typical Electricity Load Profile by Using Double Clustering of Fuzzy C-Means and Hierarchical Method," 2021 IEEE 12th Control and System Graduate Research Colloquium (ICSGRC), Shah Alam, Malaysia, 2021, pp. 277-280, doi: 10.1109/ICSGRC53186.2021.9515295.
[11] T. Yuan, J. Zhou and Y. Chen, "Identification of Industrial Process Time Series Events Based on Agglomerative Hierarchical Clustering*," 2023 China Automation Congress (CAC), Chongqing, China, 2023, pp. 6495-6499, doi: 10.1109/CAC59555.2023.10451693.
[12] F. -q. Meng, R. Huang and J. -d. Wang, "Clustering for Bug Triage Based on Developer Work Capabilities," 2025 11th International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, India, 2025, pp. 1743-1748, doi: 10.1109/ICCSP64183.2025.11088794.
[13] C. E. Laney, A. Barovic and A. Moin, "Automated Duplicate Bug Report Detection in Large Open Bug Repositories," 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC), Toronto, ON, Canada, 2025, pp. 450-458, doi: 10.1109/COMPSAC65507.2025.00065.
[14] S. Guo et al., “Developer Activity Motivated Bug Triaging: Via Convolutional Neural Network,†Neural Processing Letters, vol. 51, no. 3, pp. 2589–2606, Jun. 2020, doi: 10.1007/s11063-020-10213-y.
[15] C. Wu and B. Wang, “Extracting Topics Based on Word2Vec and Improved Jaccard Similarity Coefficient,†Proceedings - 2017 IEEE 2nd International Conference on Data Science in Cyberspace, DSC 2017, pp. 389–397, 2017, doi: 10.1109/DSC.2017.70.
[16] S. Bandyopadhyay and S. Saha, Unsupervised classification: Similarity measures, classical and metaheuristic approaches, and applications, vol. 9783642324512. Springer-Verlag Berlin Heidelberg, 2013. doi: 10.1007/978-3-642-32451-2.
[17] T. Jo, Text Mining: Concepts, Implementation, and Big Data Challenge, vol. 45. Cham: Springer International Publishing, 2019. doi: https://doi.org/10.1007/978-3-319-91815-0.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Information Technology and its Utilization

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The proposed policy for journals that offer open access
Authors who publish with this journal agree to the following terms:
- Copyright on any article is retained by the author(s).
- Author grant the journal, right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work’s authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal’s published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
- The article and any associated published material is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
