No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models

Moataz Chouchen; Ali Ouni

doi:10.1007/s10664-026-10845-z

No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models

Moataz Chouchen
, Ali Ouni

Concordia University

Résultats de recherche: Contribution à un journal › Article publié dans une revue, révisé par les pairs › Revue par des pairs

Résumé

Software analytics leverages machine learning models to extract insights from historical data on software projects. These models come with configurable parameters, known as hyperparameters, which govern their characteristics, such as the number of trees in a random forest. Hyperparameter optimization is crucial for achieving optimal performance in several software engineering problems, such as software defect prediction (SDP). To perform hyperparameter optimization, an appropriate tuning metric should be set to guide the optimal hyperparameter settings. However, the impact of the chosen tuning metric on models’ performance remains unexplored. In this paper, we address this gap by examining the impact of the hyperparameter tuning metric on the performance of software analytics models, using SDP as a case study. First, we start by investigating 105 previously published SDP studies to understand whether researchers report the employed tuning metrics. To further understand the impact of hyper-parameter tuning metrics on model performance, we conduct an empirical study on an SDP dataset comprising 28 releases, by tuning and evaluating 4 widely-used models using 8 tuning metrics and 3 common performance metrics. Our literature review reveals that researchers report the used tuning metric in only 29% of the cases, which poses a threat to the replicability of most of the SDP studies. Our empirical study results unveil several important findings: (i) selecting the appropriate tuning metric can enhance SDP model performance by up to 150% (This has been observed for K-nearest neighbor when evaluated with MCC score), (ii) the tuning metrics can be conflicting, exhibit a high degree of rank models rank inconsistency in 7% of the cases, and (iv) training data attributes like data complexity and classes overlap can form good indicators on the performance the different tuning metrics. Hence, researchers are encouraged to report on their chosen tuning metrics to improve study reproducibility. Practitioners are advised to explore multiple tuning metrics to find the best-performing model. Furthermore, attention to tuning metrics with high-rank inconsistency is recommended since it might lead to a significant performance improvement. Finally, researchers and practitioners are also encouraged to further understand how the different tuning metrics behave.

langue originale	Anglais
Numéro d'article	127
journal	Empirical Software Engineering
Volume	31
Numéro de publication	5
Les DOIs	https://doi.org/10.1007/s10664-026-10845-z
état	Publié - sept. 2026

Accès au document

10.1007/s10664-026-10845-z

Autres fichiers et liens

Lien vers la publication dans Scopus

Empreinte digitale

Voici les principaux termes ou expressions associés à « No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models ». Ces libellés thématiques sont générés à partir du titre et du résumé de la publication. Ensemble, ils forment une empreinte digitale unique.

Contient cette citation

@article{5a106abe7bbf4cb9a5d8196eb76cc28d,

title = "No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models",

abstract = "Software analytics leverages machine learning models to extract insights from historical data on software projects. These models come with configurable parameters, known as hyperparameters, which govern their characteristics, such as the number of trees in a random forest. Hyperparameter optimization is crucial for achieving optimal performance in several software engineering problems, such as software defect prediction (SDP). To perform hyperparameter optimization, an appropriate tuning metric should be set to guide the optimal hyperparameter settings. However, the impact of the chosen tuning metric on models{\textquoteright} performance remains unexplored. In this paper, we address this gap by examining the impact of the hyperparameter tuning metric on the performance of software analytics models, using SDP as a case study. First, we start by investigating 105 previously published SDP studies to understand whether researchers report the employed tuning metrics. To further understand the impact of hyper-parameter tuning metrics on model performance, we conduct an empirical study on an SDP dataset comprising 28 releases, by tuning and evaluating 4 widely-used models using 8 tuning metrics and 3 common performance metrics. Our literature review reveals that researchers report the used tuning metric in only 29\% of the cases, which poses a threat to the replicability of most of the SDP studies. Our empirical study results unveil several important findings: (i) selecting the appropriate tuning metric can enhance SDP model performance by up to 150\% (This has been observed for K-nearest neighbor when evaluated with MCC score), (ii) the tuning metrics can be conflicting, exhibit a high degree of rank models rank inconsistency in 7\% of the cases, and (iv) training data attributes like data complexity and classes overlap can form good indicators on the performance the different tuning metrics. Hence, researchers are encouraged to report on their chosen tuning metrics to improve study reproducibility. Practitioners are advised to explore multiple tuning metrics to find the best-performing model. Furthermore, attention to tuning metrics with high-rank inconsistency is recommended since it might lead to a significant performance improvement. Finally, researchers and practitioners are also encouraged to further understand how the different tuning metrics behave.",

keywords = "Empirical software engineering, Hyper-parameters tuning, Machine learning, Software analytics, Software defect prediction",

author = "Moataz Chouchen and Ali Ouni",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2026.",

year = "2026",

month = sep,

doi = "10.1007/s10664-026-10845-z",

language = "English",

volume = "31",

journal = "Empirical Software Engineering",

issn = "1382-3256",

publisher = "Springer Netherlands",

number = "5",

}

No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models. / Chouchen, Moataz; Ouni, Ali.
Dans: Empirical Software Engineering, Vol 31, Numéro 5, 127, 09.2026.

Résultats de recherche: Contribution à un journal › Article publié dans une revue, révisé par les pairs › Revue par des pairs

TY - JOUR

T1 - No silver bullet in software analytics

T2 - Understanding the impact of model tuning metrics on the performance of software defects prediction models

AU - Chouchen, Moataz

AU - Ouni, Ali

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2026.

PY - 2026/9

Y1 - 2026/9

N2 - Software analytics leverages machine learning models to extract insights from historical data on software projects. These models come with configurable parameters, known as hyperparameters, which govern their characteristics, such as the number of trees in a random forest. Hyperparameter optimization is crucial for achieving optimal performance in several software engineering problems, such as software defect prediction (SDP). To perform hyperparameter optimization, an appropriate tuning metric should be set to guide the optimal hyperparameter settings. However, the impact of the chosen tuning metric on models’ performance remains unexplored. In this paper, we address this gap by examining the impact of the hyperparameter tuning metric on the performance of software analytics models, using SDP as a case study. First, we start by investigating 105 previously published SDP studies to understand whether researchers report the employed tuning metrics. To further understand the impact of hyper-parameter tuning metrics on model performance, we conduct an empirical study on an SDP dataset comprising 28 releases, by tuning and evaluating 4 widely-used models using 8 tuning metrics and 3 common performance metrics. Our literature review reveals that researchers report the used tuning metric in only 29% of the cases, which poses a threat to the replicability of most of the SDP studies. Our empirical study results unveil several important findings: (i) selecting the appropriate tuning metric can enhance SDP model performance by up to 150% (This has been observed for K-nearest neighbor when evaluated with MCC score), (ii) the tuning metrics can be conflicting, exhibit a high degree of rank models rank inconsistency in 7% of the cases, and (iv) training data attributes like data complexity and classes overlap can form good indicators on the performance the different tuning metrics. Hence, researchers are encouraged to report on their chosen tuning metrics to improve study reproducibility. Practitioners are advised to explore multiple tuning metrics to find the best-performing model. Furthermore, attention to tuning metrics with high-rank inconsistency is recommended since it might lead to a significant performance improvement. Finally, researchers and practitioners are also encouraged to further understand how the different tuning metrics behave.

AB - Software analytics leverages machine learning models to extract insights from historical data on software projects. These models come with configurable parameters, known as hyperparameters, which govern their characteristics, such as the number of trees in a random forest. Hyperparameter optimization is crucial for achieving optimal performance in several software engineering problems, such as software defect prediction (SDP). To perform hyperparameter optimization, an appropriate tuning metric should be set to guide the optimal hyperparameter settings. However, the impact of the chosen tuning metric on models’ performance remains unexplored. In this paper, we address this gap by examining the impact of the hyperparameter tuning metric on the performance of software analytics models, using SDP as a case study. First, we start by investigating 105 previously published SDP studies to understand whether researchers report the employed tuning metrics. To further understand the impact of hyper-parameter tuning metrics on model performance, we conduct an empirical study on an SDP dataset comprising 28 releases, by tuning and evaluating 4 widely-used models using 8 tuning metrics and 3 common performance metrics. Our literature review reveals that researchers report the used tuning metric in only 29% of the cases, which poses a threat to the replicability of most of the SDP studies. Our empirical study results unveil several important findings: (i) selecting the appropriate tuning metric can enhance SDP model performance by up to 150% (This has been observed for K-nearest neighbor when evaluated with MCC score), (ii) the tuning metrics can be conflicting, exhibit a high degree of rank models rank inconsistency in 7% of the cases, and (iv) training data attributes like data complexity and classes overlap can form good indicators on the performance the different tuning metrics. Hence, researchers are encouraged to report on their chosen tuning metrics to improve study reproducibility. Practitioners are advised to explore multiple tuning metrics to find the best-performing model. Furthermore, attention to tuning metrics with high-rank inconsistency is recommended since it might lead to a significant performance improvement. Finally, researchers and practitioners are also encouraged to further understand how the different tuning metrics behave.

KW - Empirical software engineering

KW - Hyper-parameters tuning

KW - Machine learning

KW - Software analytics

KW - Software defect prediction

UR - https://www.scopus.com/pages/publications/105037997493

U2 - 10.1007/s10664-026-10845-z

DO - 10.1007/s10664-026-10845-z

M3 - Journal Article

AN - SCOPUS:105037997493

SN - 1382-3256

VL - 31

JO - Empirical Software Engineering

JF - Empirical Software Engineering

IS - 5

M1 - 127

ER -