Evaluating Vision-Language-Action Models in Robotic Manipulation: Performance, Implementation, and Comparison with Deterministic Systems

Chen, MinKemple, Jake2025-08-012025-08-012025-08-012025Kemple_washington_0250O_28398.pdfhttps://hdl.handle.net/1773/53215Thesis (Master's)--University of Washington, 2025Robotic manipulation systems typically use deterministic policies for perception, decision-making, and task planning, which achieve millimeter-level precision but require extensive specialized development and cannot easily generalize to new tasks. Emerging vision-language-action (VLA) foundation models promise to reduce this specialized effort and inflexibility through learned multimodal reasoning. However, their practicality in the real world and associated development costs remain largely unknown. This thesis presents a real-world comparison of a strong open-source VLA foundation model (OpenVLA-7B) against a fine-tuned deterministic control system. Both systems are evaluated on identical hardware consisting of a WidowX 250 6-DoF robotic arm, an Intel RealSense D415 camera, an NVIDIA Jetson AGX Orin edge computer, and the ROS 2 (Robot Operating System 2) framework. Each system repeatedly executes a pick-and-place robotic task under randomized initial conditions. Performance is measured using goal-oriented, object-centric metrics of accuracy, repeatability, and cycle time, adapted from ISO 9283 standards. Additionally, a qualitative analysis examines the installation effort and configuration challenges associated with each system. The primary contributions of this research include: (i) a comparative evaluation of performance and setup complexity between robotic systems utilizing a VLA-based control policy and conventional deterministic control logic; (ii) documentation of hardware, software, and configuration challenges encountered during VLA system implementation; and (iii) qualitative insights from real-world deployment, emphasizing usability and adaptability. Results indicate that current VLA foundation models underperform compared to deterministic control systems in terms of accuracy, repeatability, and cycle time, limiting their immediate viability for production-level robotic tasks. However, the inherent flexibility of VLA models suggests strong potential as future replacements for deterministic approaches, contingent upon improvements through fine-tuning, future optimizations, enhanced integration frameworks, and better overall performance metrics. These findings offer practical insights and set realistic expectations for developers considering transitioning from deterministic robotics systems to VLA-based implementations.application/pdfen-USCC BYEmbodied AIRobotic ManipulationRoboticsVision-Language-ActionVLAVLA ModelsComputer scienceRoboticsArtificial intelligenceComputing and software systemsEvaluating Vision-Language-Action Models in Robotic Manipulation: Performance, Implementation, and Comparison with Deterministic SystemsThesis