TY - GEN
T1 - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation
AU - Aslam, Muhammad Haseeb
AU - Martinez, Clara
AU - Pedersoli, Marco
AU - Koerich, Alessandro Lameiras
AU - Etemad, Ali
AU - Granger, Eric
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) model, student performance can surpass the teacher, particularly when the model is over-parameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple DL models becomes impractical as the number of models grows. Even distilling a deep ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications on, e.g., wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation. The student representation at each distillation step is used to guide the distillation process. Experimental results4(Code and supplementary available at: https://github.com/haseebaslam95/SSD) on real-world affective computing, wearable/biosignal (UCR Archive), HAR, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time. It incurs negligible computational complexity compared to ensemble learning and weight averaging methods.
AB - Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) model, student performance can surpass the teacher, particularly when the model is over-parameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple DL models becomes impractical as the number of models grows. Even distilling a deep ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications on, e.g., wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation. The student representation at each distillation step is used to guide the distillation process. Experimental results4(Code and supplementary available at: https://github.com/haseebaslam95/SSD) on real-world affective computing, wearable/biosignal (UCR Archive), HAR, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time. It incurs negligible computational complexity compared to ensemble learning and weight averaging methods.
KW - Deep Learning
KW - Dropout
KW - Self Distillation
KW - Student-Guided Knowledge Distillation
KW - Time-Series
UR - https://www.scopus.com/pages/publications/105020015589
U2 - 10.1007/978-3-032-06106-5_14
DO - 10.1007/978-3-032-06106-5_14
M3 - Contribution to conference proceedings
AN - SCOPUS:105020015589
SN - 9783032061058
T3 - Lecture Notes in Computer Science
SP - 235
EP - 253
BT - Machine Learning and Knowledge Discovery in Databases. Research Track - European Conference, ECML PKDD 2025, Proceedings
A2 - Ribeiro, Rita P.
A2 - Soares, Carlos
A2 - Gama, João
A2 - Pfahringer, Bernhard
A2 - Japkowicz, Nathalie
A2 - Larrañaga, Pedro
A2 - Jorge, Alípio M.
A2 - Abreu, Pedro H.
PB - Springer Science and Business Media Deutschland GmbH
T2 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2025
Y2 - 15 September 2025 through 19 September 2025
ER -