DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis

Zohreh Aghababaeyan; Manel Abdellatif; Lionel Briand; S. Ramesh

doi:10.1109/TSE.2025.3611329

DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis

Zohreh Aghababaeyan
, Manel Abdellatif
, Lionel Briand
, S. Ramesh

Résultats de recherche: Contribution à un journal › Article publié dans une revue, révisé par les pairs › Revue par des pairs

Résumé

Deep Neural Networks (DNNs) are increasingly deployed across a wide range of applications, from image classification to autonomous driving. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy levels are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between such models, particularly when testing datasets are limited, making it challenging to select or optimally combine models. Differential testing addresses this limitation by generating test inputs that expose discrepancies in the behavior of DNN models. However, existing differential testing approaches face significant limitations: many rely on access to model internals or are constrained by the availability of seed inputs, limiting their generalizability and effectiveness. In response to these challenges, we propose DiffGAN, a black-box test generation approach for differential testing of DNN models. Our approach, though adaptable to other domains, is specific to DNN models for image classification tasks, a highly prevalent application area. Our method relies on a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to generate diverse and valid triggering inputs that effectively reveal behavioral discrepancies between models. Our method employs two custom fitness functions, one focused on diversity and the other on divergence, to guide the exploration of the GAN input space and identify discrepancies between the models’ outputs. By strategically searching the GAN input space, we show that DiffGAN can effectively generate inputs with specific features that trigger differences in behavior for the models under test. Unlike traditional white-box methods, DiffGAN does not require access to the internal structure of the models, which makes it applicable to a wider range of situations. We evaluate DiffGAN on a benchmark comprising eight pairs of DNN models trained on two widely used image classification datasets. Our results demonstrate that DiffGAN significantly outperforms a state-of-the-art (SOTA) baseline, generating four times more triggering inputs, with higher diversity and validity, within the same testing budget. Furthermore, we show that the generated input can be used to improve the accuracy of a machine learning-based model selection mechanism, which dynamically selects the best-performing model based on input characteristics and can thus be used as a smart model output voting mechanism when using alternative models together.

langue originale	Anglais
Pages (de - à)	3284-3309
Nombre de pages	26
journal	IEEE Transactions on Software Engineering
Volume	51
Numéro de publication	12
Les DOIs	https://doi.org/10.1109/TSE.2025.3611329
état	Publié - 2025

Accès au document

10.1109/TSE.2025.3611329

Autres fichiers et liens

Lien vers la publication dans Scopus

Empreinte digitale

Voici les principaux termes ou expressions associés à « DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis ». Ces libellés thématiques sont générés à partir du titre et du résumé de la publication. Ensemble, ils forment une empreinte digitale unique.

Contient cette citation

@article{a862076803004521ba52e36a69612357,

title = "DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis",

abstract = "Deep Neural Networks (DNNs) are increasingly deployed across a wide range of applications, from image classification to autonomous driving. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy levels are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between such models, particularly when testing datasets are limited, making it challenging to select or optimally combine models. Differential testing addresses this limitation by generating test inputs that expose discrepancies in the behavior of DNN models. However, existing differential testing approaches face significant limitations: many rely on access to model internals or are constrained by the availability of seed inputs, limiting their generalizability and effectiveness. In response to these challenges, we propose DiffGAN, a black-box test generation approach for differential testing of DNN models. Our approach, though adaptable to other domains, is specific to DNN models for image classification tasks, a highly prevalent application area. Our method relies on a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to generate diverse and valid triggering inputs that effectively reveal behavioral discrepancies between models. Our method employs two custom fitness functions, one focused on diversity and the other on divergence, to guide the exploration of the GAN input space and identify discrepancies between the models{\textquoteright} outputs. By strategically searching the GAN input space, we show that DiffGAN can effectively generate inputs with specific features that trigger differences in behavior for the models under test. Unlike traditional white-box methods, DiffGAN does not require access to the internal structure of the models, which makes it applicable to a wider range of situations. We evaluate DiffGAN on a benchmark comprising eight pairs of DNN models trained on two widely used image classification datasets. Our results demonstrate that DiffGAN significantly outperforms a state-of-the-art (SOTA) baseline, generating four times more triggering inputs, with higher diversity and validity, within the same testing budget. Furthermore, we show that the generated input can be used to improve the accuracy of a machine learning-based model selection mechanism, which dynamically selects the best-performing model based on input characteristics and can thus be used as a smart model output voting mechanism when using alternative models together.",

keywords = "Deep neural network, GAN, NSGA-II, differential testing, model comparison, test generation",

author = "Zohreh Aghababaeyan and Manel Abdellatif and Lionel Briand and S. Ramesh",

note = "Publisher Copyright: {\textcopyright} 1976-2012 IEEE.",

year = "2025",

doi = "10.1109/TSE.2025.3611329",

language = "English",

volume = "51",

pages = "3284--3309",

journal = "IEEE Transactions on Software Engineering",

issn = "0098-5589",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "12",

}

DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis. / Aghababaeyan, Zohreh; Abdellatif, Manel; Briand, Lionel et al.
Dans: IEEE Transactions on Software Engineering, Vol 51, Numéro 12, 2025, p. 3284-3309.

Résultats de recherche: Contribution à un journal › Article publié dans une revue, révisé par les pairs › Revue par des pairs

TY - JOUR

T1 - DiffGAN

T2 - A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis

AU - Aghababaeyan, Zohreh

AU - Abdellatif, Manel

AU - Briand, Lionel

AU - Ramesh, S.

PY - 2025

Y1 - 2025

N2 - Deep Neural Networks (DNNs) are increasingly deployed across a wide range of applications, from image classification to autonomous driving. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy levels are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between such models, particularly when testing datasets are limited, making it challenging to select or optimally combine models. Differential testing addresses this limitation by generating test inputs that expose discrepancies in the behavior of DNN models. However, existing differential testing approaches face significant limitations: many rely on access to model internals or are constrained by the availability of seed inputs, limiting their generalizability and effectiveness. In response to these challenges, we propose DiffGAN, a black-box test generation approach for differential testing of DNN models. Our approach, though adaptable to other domains, is specific to DNN models for image classification tasks, a highly prevalent application area. Our method relies on a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to generate diverse and valid triggering inputs that effectively reveal behavioral discrepancies between models. Our method employs two custom fitness functions, one focused on diversity and the other on divergence, to guide the exploration of the GAN input space and identify discrepancies between the models’ outputs. By strategically searching the GAN input space, we show that DiffGAN can effectively generate inputs with specific features that trigger differences in behavior for the models under test. Unlike traditional white-box methods, DiffGAN does not require access to the internal structure of the models, which makes it applicable to a wider range of situations. We evaluate DiffGAN on a benchmark comprising eight pairs of DNN models trained on two widely used image classification datasets. Our results demonstrate that DiffGAN significantly outperforms a state-of-the-art (SOTA) baseline, generating four times more triggering inputs, with higher diversity and validity, within the same testing budget. Furthermore, we show that the generated input can be used to improve the accuracy of a machine learning-based model selection mechanism, which dynamically selects the best-performing model based on input characteristics and can thus be used as a smart model output voting mechanism when using alternative models together.

AB - Deep Neural Networks (DNNs) are increasingly deployed across a wide range of applications, from image classification to autonomous driving. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy levels are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between such models, particularly when testing datasets are limited, making it challenging to select or optimally combine models. Differential testing addresses this limitation by generating test inputs that expose discrepancies in the behavior of DNN models. However, existing differential testing approaches face significant limitations: many rely on access to model internals or are constrained by the availability of seed inputs, limiting their generalizability and effectiveness. In response to these challenges, we propose DiffGAN, a black-box test generation approach for differential testing of DNN models. Our approach, though adaptable to other domains, is specific to DNN models for image classification tasks, a highly prevalent application area. Our method relies on a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to generate diverse and valid triggering inputs that effectively reveal behavioral discrepancies between models. Our method employs two custom fitness functions, one focused on diversity and the other on divergence, to guide the exploration of the GAN input space and identify discrepancies between the models’ outputs. By strategically searching the GAN input space, we show that DiffGAN can effectively generate inputs with specific features that trigger differences in behavior for the models under test. Unlike traditional white-box methods, DiffGAN does not require access to the internal structure of the models, which makes it applicable to a wider range of situations. We evaluate DiffGAN on a benchmark comprising eight pairs of DNN models trained on two widely used image classification datasets. Our results demonstrate that DiffGAN significantly outperforms a state-of-the-art (SOTA) baseline, generating four times more triggering inputs, with higher diversity and validity, within the same testing budget. Furthermore, we show that the generated input can be used to improve the accuracy of a machine learning-based model selection mechanism, which dynamically selects the best-performing model based on input characteristics and can thus be used as a smart model output voting mechanism when using alternative models together.

KW - Deep neural network

KW - GAN

KW - NSGA-II

KW - differential testing

KW - model comparison

KW - test generation

UR - https://www.scopus.com/pages/publications/105017173063

U2 - 10.1109/TSE.2025.3611329

DO - 10.1109/TSE.2025.3611329

M3 - Journal Article

AN - SCOPUS:105017173063

SN - 0098-5589

VL - 51

SP - 3284

EP - 3309

JO - IEEE Transactions on Software Engineering

JF - IEEE Transactions on Software Engineering

IS - 12

ER -