Vision Transformer-based Model for Human Action Recognition in Still Images
Keywords:
Human action recognition, HAR, Vision transformer, ViT, Transformer encoderAbstract
Human action recognition is a critical task in computer vision, enabling systems to understand and interpret human actions from images.Action recognition in still images presents a unique challenge,as traditional methods often rely on temporal information that is absent in static images. This study explores the advantage of Vision Transformers (ViTs) for recognizing human actions in still images,exploiting their ability to capture complex patterns and relationships in visual data.We propose a robust method for recognizing human actions in still images, that employs spatial attention mechanisms to effectively highlight relevant features associated with various human poses and contexts and effectively addresses the challenges such as occlusion and varying poses by employing self-attention mechanism that focus on key pose and contextual cues. We conduct extensive experiments using benchmark datasets: Stanford40 and PASCAL VOC 2012 Action, the proposed the model achieved an impressive accuracy of 97.4% for Stanford40 and 94.8% for PASCAL VOC 2012 Action dataset. Experimental results demonstrates that the proposed method achieves SOTA performance on both still image datasets. The high accuracy suggests that the ViT model can generalize well across different action categories, even when the dataset includes variations in human posture, background, and scene complexity.