Swin-MSTP: Swin Transformer with Multi-Scale Temporal Perception for Continuous Sign Language Recognition

Authority: Neurocomputing
Category: Journal Publication

Continuous sign language recognition (CSLR) aims to recognize and interpret sequences of sign language gestures in videos. Currently, most CSLR frameworks combine spatial feature extractors based on convolutional neural networks (CNNs) with temporal convolutional networks (TCNs) for sequence learning. However, CNN-based spatial feature extractors apply the same convolutional kernel uniformly across all regions of an image, which limits their capacity to extract complex details such as fingers and facial features, which are essential for CSLR. In addition, sign languages include signs of varying lengths that cannot be accurately modeled using a fixed-size TCN. To address these issues, we present the Swin multiscale temporal perception (Swin-MSTP) framework, in which the Swin Transformer is utilized as the spatial feature extractor, capable of capturing fine spatial details and providing a stronger contextual understanding between sign language elements in video frames. The Swin Transformer was integrated with the MSTP module to extract time-wise features. Experimental results show that our single-modality system outperformed existing methods on the CSL dataset, including multimodal frameworks. The model also achieved competitive performance on the Phoenix2014, Phoenix2014-T, and CSL-Daily datasets. The code is available at https://github.com/snalyami/Swin-MSTP.

Interdisciplinary Research Centers (IRCs)

Applied Research Centers (ARCs)

Joint Research Centers (JRCs)

Research Support

CONSORTIA

Chair Professors

Visiting Scholars & Postdocs

Publications

Swin-MSTP: Swin Transformer with Multi-Scale Temporal Perception for Continuous Sign Language Recognition