Foundation models for autonomous driving: A comprehensive survey

Sonda Fourati; Wael Jaafar; Noura Baccar; Safwan Alfattani; Rami Langar

doi:10.1016/j.engappai.2026.114805

Foundation models for autonomous driving: A comprehensive survey

Sonda Fourati
, Wael Jaafar
, Noura Baccar
, Safwan Alfattani
, Rami Langar

École de technologie supérieure
Mediterranean Institute of Technology
King Abdulaziz University

Résultats de recherche: Contribution à un journal › Brève enquête › Revue par des pairs

Résumé

Large Language Models (LLMs) have showcased remarkable proficiency in various information-processing tasks. They excel at data extraction, literature summarization, content generation, predictive modeling, decision-making, and system control. Moreover, Vision-Language Models (VLMs) and Multimodal LLMs (MLLMs), collectively referred to in this work as Cross-modal Language Models (XLMs), integrate multiple data modalities with language understanding, thereby advancing Autonomous Driving Systems (ADS). On the implemented Artificial Intelligence (AI) side, we analyze core techniques such as prompt engineering, supervised fine-tuning, reinforcement learning from human feedback, knowledge distillation, quantization and pruning, and safety alignment/verification, together with edge-aware deployment strategies. On the application of AI side, we map XLMs capabilities to the driving stack, including perception, prediction, planning, control, and human–machine interaction/vehicle-to-everything, and summarize how XLMs improve scene understanding, intent forecasting, decision-making, and closed-loop control by coupling natural-language reasoning with multimodal sensory inputs, such as panoramic images, Light Detection and Ranging (LiDAR), and radar. In this survey, we synthesize the state of XLMs for ADS: we review the relevant literature on ADS and XLMs, including their architectures, tools, and frameworks. We then compare deployment approaches across the driving stack and summarize datasets, simulators, and benchmarks for both open- and closed-loop evaluation. Finally, we analyze key challenges, such as grounding and hallucination, long-tail robustness, real-time and resource constraints, safety alignment and verification, and data governance and privacy, and outline research directions toward safe, efficient, and trustworthy XLM-enabled ADS.

langue originale	Anglais
Numéro d'article	114805
journal	Engineering Applications of Artificial Intelligence
Volume	176
Les DOIs	https://doi.org/10.1016/j.engappai.2026.114805
état	Publié - 15 juil. 2026

Accès au document

10.1016/j.engappai.2026.114805

Autres fichiers et liens

Lien vers la publication dans Scopus

Empreinte digitale

Voici les principaux termes ou expressions associés à « Foundation models for autonomous driving: A comprehensive survey ». Ces libellés thématiques sont générés à partir du titre et du résumé de la publication. Ensemble, ils forment une empreinte digitale unique.

Contient cette citation

@article{1a9c530fe21442bca68352e84e2689b7,

title = "Foundation models for autonomous driving: A comprehensive survey",

abstract = "Large Language Models (LLMs) have showcased remarkable proficiency in various information-processing tasks. They excel at data extraction, literature summarization, content generation, predictive modeling, decision-making, and system control. Moreover, Vision-Language Models (VLMs) and Multimodal LLMs (MLLMs), collectively referred to in this work as Cross-modal Language Models (XLMs), integrate multiple data modalities with language understanding, thereby advancing Autonomous Driving Systems (ADS). On the implemented Artificial Intelligence (AI) side, we analyze core techniques such as prompt engineering, supervised fine-tuning, reinforcement learning from human feedback, knowledge distillation, quantization and pruning, and safety alignment/verification, together with edge-aware deployment strategies. On the application of AI side, we map XLMs capabilities to the driving stack, including perception, prediction, planning, control, and human–machine interaction/vehicle-to-everything, and summarize how XLMs improve scene understanding, intent forecasting, decision-making, and closed-loop control by coupling natural-language reasoning with multimodal sensory inputs, such as panoramic images, Light Detection and Ranging (LiDAR), and radar. In this survey, we synthesize the state of XLMs for ADS: we review the relevant literature on ADS and XLMs, including their architectures, tools, and frameworks. We then compare deployment approaches across the driving stack and summarize datasets, simulators, and benchmarks for both open- and closed-loop evaluation. Finally, we analyze key challenges, such as grounding and hallucination, long-tail robustness, real-time and resource constraints, safety alignment and verification, and data governance and privacy, and outline research directions toward safe, efficient, and trustworthy XLM-enabled ADS.",

keywords = "Application of artificial intelligence, Autonomous Driving Systems, Cross-modal Language Models, Datasets and simulators implemented artificial intelligence, Decision making and planning, Edge deployment for real-time inference, Foundation Models, Large Language Models, Multimodal large language models, Perception, prediction, planning, and control, Prompt engineering, Reinforcement Learning from Human Feedback, Safety alignment and verification, Vision Foundation Models, Vision-Language Models",

author = "Sonda Fourati and Wael Jaafar and Noura Baccar and Safwan Alfattani and Rami Langar",

note = "Publisher Copyright: {\textcopyright} 2026 The Authors.",

year = "2026",

month = jul,

day = "15",

doi = "10.1016/j.engappai.2026.114805",

language = "English",

volume = "176",

journal = "Engineering Applications of Artificial Intelligence",

issn = "0952-1976",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Foundation models for autonomous driving

T2 - A comprehensive survey

AU - Fourati, Sonda

AU - Jaafar, Wael

AU - Baccar, Noura

AU - Alfattani, Safwan

AU - Langar, Rami

PY - 2026/7/15

Y1 - 2026/7/15

N2 - Large Language Models (LLMs) have showcased remarkable proficiency in various information-processing tasks. They excel at data extraction, literature summarization, content generation, predictive modeling, decision-making, and system control. Moreover, Vision-Language Models (VLMs) and Multimodal LLMs (MLLMs), collectively referred to in this work as Cross-modal Language Models (XLMs), integrate multiple data modalities with language understanding, thereby advancing Autonomous Driving Systems (ADS). On the implemented Artificial Intelligence (AI) side, we analyze core techniques such as prompt engineering, supervised fine-tuning, reinforcement learning from human feedback, knowledge distillation, quantization and pruning, and safety alignment/verification, together with edge-aware deployment strategies. On the application of AI side, we map XLMs capabilities to the driving stack, including perception, prediction, planning, control, and human–machine interaction/vehicle-to-everything, and summarize how XLMs improve scene understanding, intent forecasting, decision-making, and closed-loop control by coupling natural-language reasoning with multimodal sensory inputs, such as panoramic images, Light Detection and Ranging (LiDAR), and radar. In this survey, we synthesize the state of XLMs for ADS: we review the relevant literature on ADS and XLMs, including their architectures, tools, and frameworks. We then compare deployment approaches across the driving stack and summarize datasets, simulators, and benchmarks for both open- and closed-loop evaluation. Finally, we analyze key challenges, such as grounding and hallucination, long-tail robustness, real-time and resource constraints, safety alignment and verification, and data governance and privacy, and outline research directions toward safe, efficient, and trustworthy XLM-enabled ADS.

AB - Large Language Models (LLMs) have showcased remarkable proficiency in various information-processing tasks. They excel at data extraction, literature summarization, content generation, predictive modeling, decision-making, and system control. Moreover, Vision-Language Models (VLMs) and Multimodal LLMs (MLLMs), collectively referred to in this work as Cross-modal Language Models (XLMs), integrate multiple data modalities with language understanding, thereby advancing Autonomous Driving Systems (ADS). On the implemented Artificial Intelligence (AI) side, we analyze core techniques such as prompt engineering, supervised fine-tuning, reinforcement learning from human feedback, knowledge distillation, quantization and pruning, and safety alignment/verification, together with edge-aware deployment strategies. On the application of AI side, we map XLMs capabilities to the driving stack, including perception, prediction, planning, control, and human–machine interaction/vehicle-to-everything, and summarize how XLMs improve scene understanding, intent forecasting, decision-making, and closed-loop control by coupling natural-language reasoning with multimodal sensory inputs, such as panoramic images, Light Detection and Ranging (LiDAR), and radar. In this survey, we synthesize the state of XLMs for ADS: we review the relevant literature on ADS and XLMs, including their architectures, tools, and frameworks. We then compare deployment approaches across the driving stack and summarize datasets, simulators, and benchmarks for both open- and closed-loop evaluation. Finally, we analyze key challenges, such as grounding and hallucination, long-tail robustness, real-time and resource constraints, safety alignment and verification, and data governance and privacy, and outline research directions toward safe, efficient, and trustworthy XLM-enabled ADS.

KW - Application of artificial intelligence

KW - Autonomous Driving Systems

KW - Cross-modal Language Models

KW - Datasets and simulators implemented artificial intelligence

KW - Decision making and planning

KW - Edge deployment for real-time inference

KW - Foundation Models

KW - Large Language Models

KW - Multimodal large language models

KW - Perception, prediction, planning, and control

KW - Prompt engineering

KW - Reinforcement Learning from Human Feedback

KW - Safety alignment and verification

KW - Vision Foundation Models

KW - Vision-Language Models

UR - https://www.scopus.com/pages/publications/105035856838

U2 - 10.1016/j.engappai.2026.114805

DO - 10.1016/j.engappai.2026.114805

M3 - Short survey

AN - SCOPUS:105035856838

SN - 0952-1976

VL - 176

JO - Engineering Applications of Artificial Intelligence

JF - Engineering Applications of Artificial Intelligence

M1 - 114805

ER -