Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation

Muhammad Haseeb Aslam; Clara Martinez; Marco Pedersoli; Alessandro Lameiras Koerich; Ali Etemad; Eric Granger

doi:10.1007/978-3-032-06106-5_14

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation

Muhammad Haseeb Aslam
, Clara Martinez
, Marco Pedersoli
, Alessandro Lameiras Koerich
, Ali Etemad
, Eric Granger

Résultats de recherche: Chapitre dans un livre, rapport, actes de conférence › Participation à un ouvrage collectif lié à un colloque ou une conférence › Revue par des pairs

Résumé

Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) model, student performance can surpass the teacher, particularly when the model is over-parameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple DL models becomes impractical as the number of models grows. Even distilling a deep ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications on, e.g., wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation. The student representation at each distillation step is used to guide the distillation process. Experimental results⁴(Code and supplementary available at: https://github.com/haseebaslam95/SSD) on real-world affective computing, wearable/biosignal (UCR Archive), HAR, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time. It incurs negligible computational complexity compared to ensemble learning and weight averaging methods.

langue originale	Anglais
titre	Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings
rédacteurs en chef	Rita P. Ribeiro, Carlos Soares, João Gama, Bernhard Pfahringer, Nathalie Japkowicz, Pedro Larrañaga, Alípio M. Jorge, Pedro H. Abreu
Editeur	Springer Science and Business Media Deutschland GmbH
Pages	235-253
Nombre de pages	19
ISBN (imprimé)	9783032061058
Les DOIs	https://doi.org/10.1007/978-3-032-06106-5_14
état	Publié - 2026
Evénement	European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2025 - Porto, Portugal Durée: 15 sept. 2025 → 19 sept. 2025

Série de publications

Nom	Lecture Notes in Computer Science
Volume	16018 LNCS
ISSN (imprimé)	0302-9743
ISSN (Electronique)	1611-3349

Conférence

Conférence	European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2025
Pays/Territoire	Portugal
La ville	Porto
période	15/09/25 → 19/09/25

Accès au document

10.1007/978-3-032-06106-5_14

Autres fichiers et liens

Lien vers la publication dans Scopus

Empreinte digitale

Voici les principaux termes ou expressions associés à « Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation ». Ces libellés thématiques sont générés à partir du titre et du résumé de la publication. Ensemble, ils forment une empreinte digitale unique.

Contient cette citation

Aslam, M. H., Martinez, C., Pedersoli, M., Koerich, A. L., Etemad, A., & Granger, E. (2026). Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation. Dans R. P. Ribeiro, C. Soares, J. Gama, B. Pfahringer, N. Japkowicz, P. Larrañaga, A. M. Jorge, & P. H. Abreu (eds.), Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings (p. 235-253). (Lecture Notes in Computer Science; Vol 16018 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-032-06106-5_14

Aslam, Muhammad Haseeb ; Martinez, Clara ; Pedersoli, Marco et al. / Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation. Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings. Editeur / Rita P. Ribeiro ; Carlos Soares ; João Gama ; Bernhard Pfahringer ; Nathalie Japkowicz ; Pedro Larrañaga ; Alípio M. Jorge ; Pedro H. Abreu. Springer Science and Business Media Deutschland GmbH, 2026. p. 235-253 (Lecture Notes in Computer Science).

@inproceedings{431ec6facc7b400297a7dabba0f37369,

title = "Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation",

abstract = "Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) model, student performance can surpass the teacher, particularly when the model is over-parameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple DL models becomes impractical as the number of models grows. Even distilling a deep ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications on, e.g., wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation. The student representation at each distillation step is used to guide the distillation process. Experimental results4(Code and supplementary available at: https://github.com/haseebaslam95/SSD) on real-world affective computing, wearable/biosignal (UCR Archive), HAR, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time. It incurs negligible computational complexity compared to ensemble learning and weight averaging methods.",

keywords = "Deep Learning, Dropout, Self Distillation, Student-Guided Knowledge Distillation, Time-Series",

author = "Aslam, \{Muhammad Haseeb\} and Clara Martinez and Marco Pedersoli and Koerich, \{Alessandro Lameiras\} and Ali Etemad and Eric Granger",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.; European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2025 ; Conference date: 15-09-2025 Through 19-09-2025",

year = "2026",

doi = "10.1007/978-3-032-06106-5\_14",

language = "English",

isbn = "9783032061058",

series = "Lecture Notes in Computer Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "235--253",

editor = "Ribeiro, \{Rita P.\} and Carlos Soares and Jo{\~a}o Gama and Bernhard Pfahringer and Nathalie Japkowicz and Pedro Larra{\~n}aga and Jorge, \{Al{\'i}pio M.\} and Abreu, \{Pedro H.\}",

booktitle = "Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings",

}

Aslam, MH, Martinez, C, Pedersoli, M , Koerich, AL, Etemad, A & Granger, E 2026, Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation. Dans RP Ribeiro, C Soares, J Gama, B Pfahringer, N Japkowicz, P Larrañaga, AM Jorge & PH Abreu (eds), Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings. Lecture Notes in Computer Science, VOL. 16018 LNCS, Springer Science and Business Media Deutschland GmbH, p. 235-253, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2025, Porto, Portugal, 15/09/25. https://doi.org/10.1007/978-3-032-06106-5_14

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation. / Aslam, Muhammad Haseeb; Martinez, Clara; Pedersoli, Marco et al.
Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings. Ed. / Rita P. Ribeiro; Carlos Soares; João Gama; Bernhard Pfahringer; Nathalie Japkowicz; Pedro Larrañaga; Alípio M. Jorge; Pedro H. Abreu. Springer Science and Business Media Deutschland GmbH, 2026. p. 235-253 (Lecture Notes in Computer Science; Vol 16018 LNCS).

Résultats de recherche: Chapitre dans un livre, rapport, actes de conférence › Participation à un ouvrage collectif lié à un colloque ou une conférence › Revue par des pairs

TY - GEN

T1 - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation

AU - Aslam, Muhammad Haseeb

AU - Martinez, Clara

AU - Pedersoli, Marco

AU - Koerich, Alessandro Lameiras

AU - Etemad, Ali

AU - Granger, Eric

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

PY - 2026

Y1 - 2026

N2 - Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) model, student performance can surpass the teacher, particularly when the model is over-parameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple DL models becomes impractical as the number of models grows. Even distilling a deep ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications on, e.g., wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation. The student representation at each distillation step is used to guide the distillation process. Experimental results4(Code and supplementary available at: https://github.com/haseebaslam95/SSD) on real-world affective computing, wearable/biosignal (UCR Archive), HAR, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time. It incurs negligible computational complexity compared to ensemble learning and weight averaging methods.

AB - Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) model, student performance can surpass the teacher, particularly when the model is over-parameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple DL models becomes impractical as the number of models grows. Even distilling a deep ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications on, e.g., wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation. The student representation at each distillation step is used to guide the distillation process. Experimental results4(Code and supplementary available at: https://github.com/haseebaslam95/SSD) on real-world affective computing, wearable/biosignal (UCR Archive), HAR, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time. It incurs negligible computational complexity compared to ensemble learning and weight averaging methods.

KW - Deep Learning

KW - Dropout

KW - Self Distillation

KW - Student-Guided Knowledge Distillation

KW - Time-Series

UR - https://www.scopus.com/pages/publications/105020015589

U2 - 10.1007/978-3-032-06106-5_14

DO - 10.1007/978-3-032-06106-5_14

M3 - Contribution to conference proceedings

AN - SCOPUS:105020015589

SN - 9783032061058

T3 - Lecture Notes in Computer Science

SP - 235

EP - 253

BT - Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings

A2 - Ribeiro, Rita P.

A2 - Soares, Carlos

A2 - Gama, João

A2 - Pfahringer, Bernhard

A2 - Japkowicz, Nathalie

A2 - Larrañaga, Pedro

A2 - Jorge, Alípio M.

A2 - Abreu, Pedro H.

PB - Springer Science and Business Media Deutschland GmbH

T2 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2025

Y2 - 15 September 2025 through 19 September 2025

ER -

Aslam MH, Martinez C, Pedersoli M , Koerich AL, Etemad A, Granger E. Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation. Dans Ribeiro RP, Soares C, Gama J, Pfahringer B, Japkowicz N, Larrañaga P, Jorge AM, Abreu PH, rédacteurs en chef, Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings. Springer Science and Business Media Deutschland GmbH. 2026. p. 235-253. (Lecture Notes in Computer Science). doi: 10.1007/978-3-032-06106-5_14