Deep Learning Methods for Video-Based Human Activity Recognition in Industrial Settings

Parsa, BehnooshDeep Learning Methods for Video-Based Human Activity Recognition in Industrial SettingsMy University2021Computer VisionDeep LearningGraph Convolutional NetworksHuman Activity RecognitionHuman Postural AssessmentVideo Semantic SegmentationMechanical engineeringMechanical engineeringMy UniversityMy UniversityBanerjee, Ashis G2021-03-192021-03-192020en-USThesisParsa_washington_0250E_22404.pdfhttp://hdl.handle.net/1773/46847application/pdfCC BY-NCThesis (Ph.D.)--University of Washington, 2020With increasingly high interest in assistive robots and smart surveillance systems, we need a powerful perception mechanism to be able to describe the events in a scene. However, achieving accurate perception models is not trivial, since, even for one perception task there are unlimited possible scenarios. Hoping to develop analytically driven models seems too optimistic for such systems; hence, Supervised Learning as a sub-field of function approximation has become very popular in robotic perception. Supervised learning is the task of learning a function that maps an input to an output based on example input-output pairs. Scene understanding is even more involved when it comes to solving Human Action Recognition (HAR) problems. In HAR the task is to classify human activities from an image or determine atomic actions composing the activity in a video. In video-based HAR, there are exponentially many ways that humans can perform the same task. Besides, the variety in posture and speed at which people perform activities makes solving HAR tasks even more challenging. Therefore, models should be designed to learn common underlying spatial and temporal properties of human activity to achieve generalizability. This thesis is dedicated to designing perception models for recognizing human actions and determining the ergonomic risk associated with them. Specifically, Part I focus on solving the Human Activity Segmentation (HAS) problem in long videos, which is the task of semantically segmenting long videos into distinct actions in an offline framework. In Part II, we present our designs for solving online-HAR problems to recognize human activities in the observed batch of frames. Since, the performance of computer vision algorithms also depends on the quality and relevance of the training data, in Part I, we introduce a new dataset for an indoor object manipulation task called the University of Washington Indoor Object Manipulation (UW-IOM).