Foundation models for autonomous driving: A comprehensive survey

Sonda Fourati; Wael Jaafar; Noura Baccar; Safwan Alfattani; Rami Langar

doi:10.1016/j.engappai.2026.114805

Foundation models for autonomous driving: A comprehensive survey

Sonda Fourati
, Wael Jaafar
, Noura Baccar
, Safwan Alfattani
, Rami Langar

École de technologie supérieure
Mediterranean Institute of Technology
King Abdulaziz University

Research output: Contribution to journal › Short survey › peer-review

Abstract

Large Language Models (LLMs) have showcased remarkable proficiency in various information-processing tasks. They excel at data extraction, literature summarization, content generation, predictive modeling, decision-making, and system control. Moreover, Vision-Language Models (VLMs) and Multimodal LLMs (MLLMs), collectively referred to in this work as Cross-modal Language Models (XLMs), integrate multiple data modalities with language understanding, thereby advancing Autonomous Driving Systems (ADS). On the implemented Artificial Intelligence (AI) side, we analyze core techniques such as prompt engineering, supervised fine-tuning, reinforcement learning from human feedback, knowledge distillation, quantization and pruning, and safety alignment/verification, together with edge-aware deployment strategies. On the application of AI side, we map XLMs capabilities to the driving stack, including perception, prediction, planning, control, and human–machine interaction/vehicle-to-everything, and summarize how XLMs improve scene understanding, intent forecasting, decision-making, and closed-loop control by coupling natural-language reasoning with multimodal sensory inputs, such as panoramic images, Light Detection and Ranging (LiDAR), and radar. In this survey, we synthesize the state of XLMs for ADS: we review the relevant literature on ADS and XLMs, including their architectures, tools, and frameworks. We then compare deployment approaches across the driving stack and summarize datasets, simulators, and benchmarks for both open- and closed-loop evaluation. Finally, we analyze key challenges, such as grounding and hallucination, long-tail robustness, real-time and resource constraints, safety alignment and verification, and data governance and privacy, and outline research directions toward safe, efficient, and trustworthy XLM-enabled ADS.

Original language	English
Article number	114805
Journal	Engineering Applications of Artificial Intelligence
Volume	176
DOIs	https://doi.org/10.1016/j.engappai.2026.114805
Publication status	Published - 15 Jul 2026

!!!Keywords

Application of artificial intelligence
Autonomous Driving Systems
Cross-modal Language Models
Datasets and simulators implemented artificial intelligence
Decision making and planning
Edge deployment for real-time inference
Foundation Models
Large Language Models
Multimodal large language models
Perception, prediction, planning, and control
Prompt engineering
Reinforcement Learning from Human Feedback
Safety alignment and verification
Vision Foundation Models
Vision-Language Models

Access to Document

10.1016/j.engappai.2026.114805

Cite this

@article{1a9c530fe21442bca68352e84e2689b7,

title = "Foundation models for autonomous driving: A comprehensive survey",

abstract = "Large Language Models (LLMs) have showcased remarkable proficiency in various information-processing tasks. They excel at data extraction, literature summarization, content generation, predictive modeling, decision-making, and system control. Moreover, Vision-Language Models (VLMs) and Multimodal LLMs (MLLMs), collectively referred to in this work as Cross-modal Language Models (XLMs), integrate multiple data modalities with language understanding, thereby advancing Autonomous Driving Systems (ADS). On the implemented Artificial Intelligence (AI) side, we analyze core techniques such as prompt engineering, supervised fine-tuning, reinforcement learning from human feedback, knowledge distillation, quantization and pruning, and safety alignment/verification, together with edge-aware deployment strategies. On the application of AI side, we map XLMs capabilities to the driving stack, including perception, prediction, planning, control, and human–machine interaction/vehicle-to-everything, and summarize how XLMs improve scene understanding, intent forecasting, decision-making, and closed-loop control by coupling natural-language reasoning with multimodal sensory inputs, such as panoramic images, Light Detection and Ranging (LiDAR), and radar. In this survey, we synthesize the state of XLMs for ADS: we review the relevant literature on ADS and XLMs, including their architectures, tools, and frameworks. We then compare deployment approaches across the driving stack and summarize datasets, simulators, and benchmarks for both open- and closed-loop evaluation. Finally, we analyze key challenges, such as grounding and hallucination, long-tail robustness, real-time and resource constraints, safety alignment and verification, and data governance and privacy, and outline research directions toward safe, efficient, and trustworthy XLM-enabled ADS.",

keywords = "Application of artificial intelligence, Autonomous Driving Systems, Cross-modal Language Models, Datasets and simulators implemented artificial intelligence, Decision making and planning, Edge deployment for real-time inference, Foundation Models, Large Language Models, Multimodal large language models, Perception, prediction, planning, and control, Prompt engineering, Reinforcement Learning from Human Feedback, Safety alignment and verification, Vision Foundation Models, Vision-Language Models",

author = "Sonda Fourati and Wael Jaafar and Noura Baccar and Safwan Alfattani and Rami Langar",

note = "Publisher Copyright: {\textcopyright} 2026 The Authors.",

year = "2026",

month = jul,

day = "15",

doi = "10.1016/j.engappai.2026.114805",

language = "English",

volume = "176",

journal = "Engineering Applications of Artificial Intelligence",

issn = "0952-1976",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Foundation models for autonomous driving

T2 - A comprehensive survey

AU - Fourati, Sonda

AU - Jaafar, Wael

AU - Baccar, Noura

AU - Alfattani, Safwan

AU - Langar, Rami

PY - 2026/7/15

Y1 - 2026/7/15

N2 - Large Language Models (LLMs) have showcased remarkable proficiency in various information-processing tasks. They excel at data extraction, literature summarization, content generation, predictive modeling, decision-making, and system control. Moreover, Vision-Language Models (VLMs) and Multimodal LLMs (MLLMs), collectively referred to in this work as Cross-modal Language Models (XLMs), integrate multiple data modalities with language understanding, thereby advancing Autonomous Driving Systems (ADS). On the implemented Artificial Intelligence (AI) side, we analyze core techniques such as prompt engineering, supervised fine-tuning, reinforcement learning from human feedback, knowledge distillation, quantization and pruning, and safety alignment/verification, together with edge-aware deployment strategies. On the application of AI side, we map XLMs capabilities to the driving stack, including perception, prediction, planning, control, and human–machine interaction/vehicle-to-everything, and summarize how XLMs improve scene understanding, intent forecasting, decision-making, and closed-loop control by coupling natural-language reasoning with multimodal sensory inputs, such as panoramic images, Light Detection and Ranging (LiDAR), and radar. In this survey, we synthesize the state of XLMs for ADS: we review the relevant literature on ADS and XLMs, including their architectures, tools, and frameworks. We then compare deployment approaches across the driving stack and summarize datasets, simulators, and benchmarks for both open- and closed-loop evaluation. Finally, we analyze key challenges, such as grounding and hallucination, long-tail robustness, real-time and resource constraints, safety alignment and verification, and data governance and privacy, and outline research directions toward safe, efficient, and trustworthy XLM-enabled ADS.

AB - Large Language Models (LLMs) have showcased remarkable proficiency in various information-processing tasks. They excel at data extraction, literature summarization, content generation, predictive modeling, decision-making, and system control. Moreover, Vision-Language Models (VLMs) and Multimodal LLMs (MLLMs), collectively referred to in this work as Cross-modal Language Models (XLMs), integrate multiple data modalities with language understanding, thereby advancing Autonomous Driving Systems (ADS). On the implemented Artificial Intelligence (AI) side, we analyze core techniques such as prompt engineering, supervised fine-tuning, reinforcement learning from human feedback, knowledge distillation, quantization and pruning, and safety alignment/verification, together with edge-aware deployment strategies. On the application of AI side, we map XLMs capabilities to the driving stack, including perception, prediction, planning, control, and human–machine interaction/vehicle-to-everything, and summarize how XLMs improve scene understanding, intent forecasting, decision-making, and closed-loop control by coupling natural-language reasoning with multimodal sensory inputs, such as panoramic images, Light Detection and Ranging (LiDAR), and radar. In this survey, we synthesize the state of XLMs for ADS: we review the relevant literature on ADS and XLMs, including their architectures, tools, and frameworks. We then compare deployment approaches across the driving stack and summarize datasets, simulators, and benchmarks for both open- and closed-loop evaluation. Finally, we analyze key challenges, such as grounding and hallucination, long-tail robustness, real-time and resource constraints, safety alignment and verification, and data governance and privacy, and outline research directions toward safe, efficient, and trustworthy XLM-enabled ADS.

KW - Application of artificial intelligence

KW - Autonomous Driving Systems

KW - Cross-modal Language Models

KW - Datasets and simulators implemented artificial intelligence

KW - Decision making and planning

KW - Edge deployment for real-time inference

KW - Foundation Models

KW - Large Language Models

KW - Multimodal large language models

KW - Perception, prediction, planning, and control

KW - Prompt engineering

KW - Reinforcement Learning from Human Feedback

KW - Safety alignment and verification

KW - Vision Foundation Models

KW - Vision-Language Models

UR - https://www.scopus.com/pages/publications/105035856838

U2 - 10.1016/j.engappai.2026.114805

DO - 10.1016/j.engappai.2026.114805

M3 - Short survey

AN - SCOPUS:105035856838

SN - 0952-1976

VL - 176

JO - Engineering Applications of Artificial Intelligence

JF - Engineering Applications of Artificial Intelligence

M1 - 114805

ER -

Foundation models for autonomous driving: A comprehensive survey

Abstract

!!!Keywords

Access to Document

Other files and links

Fingerprint

Cite this