No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models

Moataz Chouchen; Ali Ouni

doi:10.1007/s10664-026-10845-z

No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models

Moataz Chouchen
, Ali Ouni

Concordia University

Research output: Contribution to journal › Journal Article › peer-review

Abstract

Software analytics leverages machine learning models to extract insights from historical data on software projects. These models come with configurable parameters, known as hyperparameters, which govern their characteristics, such as the number of trees in a random forest. Hyperparameter optimization is crucial for achieving optimal performance in several software engineering problems, such as software defect prediction (SDP). To perform hyperparameter optimization, an appropriate tuning metric should be set to guide the optimal hyperparameter settings. However, the impact of the chosen tuning metric on models’ performance remains unexplored. In this paper, we address this gap by examining the impact of the hyperparameter tuning metric on the performance of software analytics models, using SDP as a case study. First, we start by investigating 105 previously published SDP studies to understand whether researchers report the employed tuning metrics. To further understand the impact of hyper-parameter tuning metrics on model performance, we conduct an empirical study on an SDP dataset comprising 28 releases, by tuning and evaluating 4 widely-used models using 8 tuning metrics and 3 common performance metrics. Our literature review reveals that researchers report the used tuning metric in only 29% of the cases, which poses a threat to the replicability of most of the SDP studies. Our empirical study results unveil several important findings: (i) selecting the appropriate tuning metric can enhance SDP model performance by up to 150% (This has been observed for K-nearest neighbor when evaluated with MCC score), (ii) the tuning metrics can be conflicting, exhibit a high degree of rank models rank inconsistency in 7% of the cases, and (iv) training data attributes like data complexity and classes overlap can form good indicators on the performance the different tuning metrics. Hence, researchers are encouraged to report on their chosen tuning metrics to improve study reproducibility. Practitioners are advised to explore multiple tuning metrics to find the best-performing model. Furthermore, attention to tuning metrics with high-rank inconsistency is recommended since it might lead to a significant performance improvement. Finally, researchers and practitioners are also encouraged to further understand how the different tuning metrics behave.

Original language	English
Article number	127
Journal	Empirical Software Engineering
Volume	31
Issue number	5
DOIs	https://doi.org/10.1007/s10664-026-10845-z
Publication status	Published - Sept 2026

!!!Keywords

Empirical software engineering
Hyper-parameters tuning
Machine learning
Software analytics
Software defect prediction

Access to Document

10.1007/s10664-026-10845-z

Fingerprint

Dive into the research topics of 'No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models'. These topics are generated from the title and abstract of the publication. Together, they form a unique fingerprint.

Cite this

@article{5a106abe7bbf4cb9a5d8196eb76cc28d,

title = "No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models",

abstract = "Software analytics leverages machine learning models to extract insights from historical data on software projects. These models come with configurable parameters, known as hyperparameters, which govern their characteristics, such as the number of trees in a random forest. Hyperparameter optimization is crucial for achieving optimal performance in several software engineering problems, such as software defect prediction (SDP). To perform hyperparameter optimization, an appropriate tuning metric should be set to guide the optimal hyperparameter settings. However, the impact of the chosen tuning metric on models{\textquoteright} performance remains unexplored. In this paper, we address this gap by examining the impact of the hyperparameter tuning metric on the performance of software analytics models, using SDP as a case study. First, we start by investigating 105 previously published SDP studies to understand whether researchers report the employed tuning metrics. To further understand the impact of hyper-parameter tuning metrics on model performance, we conduct an empirical study on an SDP dataset comprising 28 releases, by tuning and evaluating 4 widely-used models using 8 tuning metrics and 3 common performance metrics. Our literature review reveals that researchers report the used tuning metric in only 29\% of the cases, which poses a threat to the replicability of most of the SDP studies. Our empirical study results unveil several important findings: (i) selecting the appropriate tuning metric can enhance SDP model performance by up to 150\% (This has been observed for K-nearest neighbor when evaluated with MCC score), (ii) the tuning metrics can be conflicting, exhibit a high degree of rank models rank inconsistency in 7\% of the cases, and (iv) training data attributes like data complexity and classes overlap can form good indicators on the performance the different tuning metrics. Hence, researchers are encouraged to report on their chosen tuning metrics to improve study reproducibility. Practitioners are advised to explore multiple tuning metrics to find the best-performing model. Furthermore, attention to tuning metrics with high-rank inconsistency is recommended since it might lead to a significant performance improvement. Finally, researchers and practitioners are also encouraged to further understand how the different tuning metrics behave.",

keywords = "Empirical software engineering, Hyper-parameters tuning, Machine learning, Software analytics, Software defect prediction",

author = "Moataz Chouchen and Ali Ouni",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2026.",

year = "2026",

month = sep,

doi = "10.1007/s10664-026-10845-z",

language = "English",

volume = "31",

journal = "Empirical Software Engineering",

issn = "1382-3256",

publisher = "Springer Netherlands",

number = "5",

}

TY - JOUR

T1 - No silver bullet in software analytics

T2 - Understanding the impact of model tuning metrics on the performance of software defects prediction models

AU - Chouchen, Moataz

AU - Ouni, Ali

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2026.

PY - 2026/9

Y1 - 2026/9

N2 - Software analytics leverages machine learning models to extract insights from historical data on software projects. These models come with configurable parameters, known as hyperparameters, which govern their characteristics, such as the number of trees in a random forest. Hyperparameter optimization is crucial for achieving optimal performance in several software engineering problems, such as software defect prediction (SDP). To perform hyperparameter optimization, an appropriate tuning metric should be set to guide the optimal hyperparameter settings. However, the impact of the chosen tuning metric on models’ performance remains unexplored. In this paper, we address this gap by examining the impact of the hyperparameter tuning metric on the performance of software analytics models, using SDP as a case study. First, we start by investigating 105 previously published SDP studies to understand whether researchers report the employed tuning metrics. To further understand the impact of hyper-parameter tuning metrics on model performance, we conduct an empirical study on an SDP dataset comprising 28 releases, by tuning and evaluating 4 widely-used models using 8 tuning metrics and 3 common performance metrics. Our literature review reveals that researchers report the used tuning metric in only 29% of the cases, which poses a threat to the replicability of most of the SDP studies. Our empirical study results unveil several important findings: (i) selecting the appropriate tuning metric can enhance SDP model performance by up to 150% (This has been observed for K-nearest neighbor when evaluated with MCC score), (ii) the tuning metrics can be conflicting, exhibit a high degree of rank models rank inconsistency in 7% of the cases, and (iv) training data attributes like data complexity and classes overlap can form good indicators on the performance the different tuning metrics. Hence, researchers are encouraged to report on their chosen tuning metrics to improve study reproducibility. Practitioners are advised to explore multiple tuning metrics to find the best-performing model. Furthermore, attention to tuning metrics with high-rank inconsistency is recommended since it might lead to a significant performance improvement. Finally, researchers and practitioners are also encouraged to further understand how the different tuning metrics behave.

AB - Software analytics leverages machine learning models to extract insights from historical data on software projects. These models come with configurable parameters, known as hyperparameters, which govern their characteristics, such as the number of trees in a random forest. Hyperparameter optimization is crucial for achieving optimal performance in several software engineering problems, such as software defect prediction (SDP). To perform hyperparameter optimization, an appropriate tuning metric should be set to guide the optimal hyperparameter settings. However, the impact of the chosen tuning metric on models’ performance remains unexplored. In this paper, we address this gap by examining the impact of the hyperparameter tuning metric on the performance of software analytics models, using SDP as a case study. First, we start by investigating 105 previously published SDP studies to understand whether researchers report the employed tuning metrics. To further understand the impact of hyper-parameter tuning metrics on model performance, we conduct an empirical study on an SDP dataset comprising 28 releases, by tuning and evaluating 4 widely-used models using 8 tuning metrics and 3 common performance metrics. Our literature review reveals that researchers report the used tuning metric in only 29% of the cases, which poses a threat to the replicability of most of the SDP studies. Our empirical study results unveil several important findings: (i) selecting the appropriate tuning metric can enhance SDP model performance by up to 150% (This has been observed for K-nearest neighbor when evaluated with MCC score), (ii) the tuning metrics can be conflicting, exhibit a high degree of rank models rank inconsistency in 7% of the cases, and (iv) training data attributes like data complexity and classes overlap can form good indicators on the performance the different tuning metrics. Hence, researchers are encouraged to report on their chosen tuning metrics to improve study reproducibility. Practitioners are advised to explore multiple tuning metrics to find the best-performing model. Furthermore, attention to tuning metrics with high-rank inconsistency is recommended since it might lead to a significant performance improvement. Finally, researchers and practitioners are also encouraged to further understand how the different tuning metrics behave.

KW - Empirical software engineering

KW - Hyper-parameters tuning

KW - Machine learning

KW - Software analytics

KW - Software defect prediction

UR - https://www.scopus.com/pages/publications/105037997493

U2 - 10.1007/s10664-026-10845-z

DO - 10.1007/s10664-026-10845-z

M3 - Journal Article

AN - SCOPUS:105037997493

SN - 1382-3256

VL - 31

JO - Empirical Software Engineering

JF - Empirical Software Engineering

IS - 5

M1 - 127

ER -

No silver bullet in software analytics: Understanding the impact of model tuning metrics on the performance of software defects prediction models

Abstract

!!!Keywords

Access to Document

Other files and links

Fingerprint

Cite this