In recent years, gesture recognition has been widely used in the fields of intelligent driving, virtual reality, and human-computer interaction. With the development of artificial intelligence, deep learning has achieved remarkable success in computer vision. To help researchers better understanding the development status of gesture recognition in video, this article provides a detailed survey of the latest developments in gesture recognition technology for videos based on deep learning. The reviewed methods are broadly categorized into three groups based on the type of neural networks used for recognition: two-stream convolutional neural networks, 3D convolutional neural networks, and Long-short Term Memory (LSTM) networks. In this review, we discuss the advantages and limitations of existing technologies, focusing on the feature extraction method of the spatiotemporal structure information in a video sequence, and consider future research directions.
The field of vision-based human hand three-dimensional (3D) shape and pose estimation has attracted significant attention recently owing to its key role in various applications, such as natural human-computer interactions. With the availability of large-scale annotated hand datasets and the rapid developments of deep neural networks (DNNs), numerous DNN-based data-driven methods have been proposed for accurate and rapid hand shape and pose estimation. Nonetheless, the existence of complicated hand articulation, depth and scale ambiguities, occlusions, and finger similarity remain challenging. In this study, we present a comprehensive survey of state-of-the-art 3D hand shape and pose estimation approaches using RGB-D cameras. Related RGB-D cameras, hand datasets, and a performance analysis are also discussed to provide a holistic view of recent achievements. We also discuss the research potential of this rapidly growing field.
Gesture recognition has attracted significant attention because of its wide range of potential applications. Although multi-modal gesture recognition has made significant progress in recent years, a popular method still is simply fusing prediction scores at the end of each branch, which often ignores complementary features among different modalities in the early stage and does not fuse the complementary features into a more discriminative feature.
This paper proposes an Adaptive Cross-modal Weighting (ACmW) scheme to exploit complementarity features from RGB-D data in this study. The scheme learns relations among different modalities by combining the features of different data streams. The proposed ACmW module contains two key functions: (1) fusing complementary features from multiple streams through an adaptive one-dimensional convolution; and (2) modeling the correlation of multi-stream complementary features in the time dimension. Through the effective combination of these two functional modules, the proposed ACmW can automatically analyze the relationship between the complementary features from different streams, and can fuse them in the spatial and temporal dimensions.
Extensive experiments validate the effectiveness of the proposed method, and show that our method outperforms state-of-the-art methods on IsoGD and NVGesture.
There is a large group of deaf-mutes worldwide, and sign language is a major communication tool in this community. It is necessary for deaf-mutes to be able to communicate with others who are capable of hearing, and hearing people also need to understand sign language, which produces a great demand for sign language tuition. Even though there have already been a large number of books written about sign language, it is inefficient to learn sign language through reading alone, and the same can be said on watching videos. To solve this problem, we developed a smartphone-based interactive Chinese sign language teaching system that facilitates sign language learning.
The system provides a learner with some learning modes and captures the learner's actions using the front camera of the smartphone. At present, the system provides a vocabulary set with 1000 frequently used words, and the learner can evaluate his/her sign action by subjective or objective comparison. In the mode of word recognition, the users can play any word within the vocabulary, and the system will return the top three retrieved candidates; thus, it can remind the learners what the sign is.
This system provides interactive learning to enable a user to efficiently learn sign language. The system adopts an algorithm based on point cloud recognition to evaluate a user's sign and costs about 700ms of inference time for each sample, which meets the real-time requirements.
This interactive learning system decreases the communication barriers between deaf-mutes and hearing people.