Automated Road Environment Perception for Safety Assessment with Tuning‑Free Vision Foundation Models

Yin, Shuyi

Automated Road Environment Perception for Safety Assessment with Tuning‑Free Vision Foundation Models

Date

2025-08-01

relationships.isAuthorOf

Yin, Shuyi

Abstract

Persistent roadway fatalities in the United States expose the limitations of reactive safety practice: engineering countermeasures are usually deployed after crashes accumulate and patterns emerge. The Safe System philosophy seeks to break this cycle by insisting that road environment be designed to proactively anticipate human error, and the International Road Assessment Programme (iRAP) Star Ratings materializes this philosophy by scoring every 100-meter segment on its ability to prevent human errors from happening and to tolerate these mistakes if they occur. Yet producing these ratings still relies on labor-intensive image annotation that covers only a few dozen kilometers per annotator per day. Meanwhile, vast archives of street-level imagery already blanket road networks at scale, and recent vision foundation models can segment objects they are not explicitly trained to recognize. Closing the gap between this technological capacity and today’s slow-moving manual workflows—so that Star Ratings can be generated quickly, consistently, and at scale—forms the core mission of this dissertation. To do so, this research pursues three tightly linked objectives. First, it diagnoses why existing computer vision pipelines fall short of iRAP needs, formalizes road environment perception as a domain task for safety assessment, and identifies the Segment Anything Model (SAM) as a promising backbone for exploration. Secondly, it designs pipelines that recast infrastructure recognition as a semantic segmentation task, enriching SAM with either domain-aware priors that guide mask labeling or tailored prompts that map infrastructure class descriptions, all while preserving the model’s zero-shot versatility. Thirdly, it integrates these methodological proposals into real-world workflows in elements of Safe System framework, demonstrating end-to-end scalability in agency contexts. These goals materialize in three core contributions. Firstly, this dissertation demonstrates that emerging transportation big data sources, such as street-level images and connected vehicle data, streamline and automate key steps in road safety assessment, reducing manual effort while preserving analytical rigor for road infrastructure classes. Secondly, this dissertation introduces two vision pipelines that leverage tuning-free foundation models, augmented with domain priors and engineered class prompts, to steer semantic segmentation for infrastructure element recognition without task-specific fine-tuning. Grid-based sampling strategy first derives SAM’s class-agnostic masks, then injects domain knowledge through uncertainty-weighted votes from specialized domain models, and finally employs an iterative sampler to close coverage gaps for road infrastructure object classes, boosting their performances on the research dataset. Target sampling accelerates segmentation of high-priority infrastructure elements by pairing SAM with a multi-modal foundation model that converts engineered class descriptions into informative box and point prompts, eliminating post-hoc voting and achieving zero-shot transferability. Thirdly, this dissertation integrates the proposed vision pipelines into a unified probabilistic framework that quantifies recognition uncertainty of speed limit signs in street-level images, fuses these observations with connected vehicle data, models sampling- and penetration-rate variability, and estimates link-level posterior distributions of speed compliance risk at network scale. In conclusion, this research underscores the imperative for automated safety assessment that places Safe Roads at the center of the Safe System paradigm, recasts that need as a road environment perception challenge, and formalizes the challenge as a tractable semantic segmentation problem. By combining data-driven priors, domain-tuned prompts, and uncertainty-aware fusion, the proposed techniques harness vision foundation models to convert ubiquitous street-level imagery into a scalable source for iRAP-aligned safety assessments. The resulting pipelines accelerate network-wide safety assessments, have the potential to lower survey costs, and open new transportation-domain research frontiers on prompt engineering, foundation model customization, and multi-modal risk analysis.