Publications

SignVLM: a pre-trained large video model for sign language recognition

  • Authority: PeerJ Computer Science
  • Category: Journal Publication

Sign language recognition (SLR) plays a vital role in including people with hearing impairment in the community. It facilitates the recognition of sign gestures and converts them into spoken languages. One of the main challenges for developing SLR systems is the lack of annotated datasets. This issue is more noticeable with low–resourced sign languages. To address this issue, we propose a pretrained large vision model, SignVLM, for SLR. This work explores the capability of the contrastive language–image pre-training (CLIP) model for SLR. This model is used to extract spatial features from the sign video frames while a Transformer decoder is used for temporal learning. The proposed model has been evaluated on four different sign languages using the KArSL, WLASL, LSA64, and AUSTL datasets. Different evaluation settings have been followed in this work including zero-shot and few-shot learning. The proposed model outperformed other models on the KArSL, WLASL, and LSA64 datasets and achieved comparable performance on the AUTSL dataset. The obtained results demonstrate the generalization of the proposed model to new datasets with few samples. The code and data are available at https://github.com/Hamzah-Luqman/signVLM.