WIMI Hologram Academy: Vision-based Human-Computer Gesture Interaction Technology in Virtual Reality


HONG KONG, July 18, 2022 (GLOBE NEWSWIRE) -- WIMI Hologram Academy, working in partnership with the Holographic Science Innovation Center, has written a new technical article describing vision-based human-computer gesture interaction technology in virtual reality This article follows below:

Interaction is one of the three characteristics of virtual reality. Human-computer interaction in virtual reality refers to the user's interaction with the virtual world objects generated by the computer in a portable and natural way through interactive devices. Scientists from WIMI Hologram Academy of WIMI Hologram Cloud Inc.(NASDAQ: WIMI) discussed a new technology in virtual reality interaction: vision-based human-computer gesture interaction technology.

1.Visual-based gesture interaction technology

Gestures are the most important way of non-verbal communication between people, and one of the most important ways of interaction between human and VR virtual environments. The accuracy and speed of gesture recognition directly affect the accuracy, fluency and nature of human-computer interaction. Visual-based gesture interaction technology enables users to wear no devices. This technology is very convenient, natural, rich in expression, in line with the general trend of natural human-computer interaction, and has a wide range of applications. As an important part of human-computer interaction, vision-based gesture interaction is important for realizing natural interaction between human and VR virtual environment. And its application prospects are wide.

Vision-based gesture interaction uses gesture recognition method to realize human-computer interaction. The interaction process consists of four main steps: 1) Data acquisition: the human hand image is acquired by camera. 2) Hand detection and segmentation: detect whether there is a hand in the input image, and if there is a hand, detect the specific location of the hand and segment the hand. 3) Gesture recognition: extract the features of the hand region and identify its type according to certain methods. 4) Use the recognition result to control the person or object in the virtual environment: send the recognition result to the virtual environment control system to control the virtual character to achieve specific movements. Among them, gesture recognition is the core of the whole gesture interaction process, while hand detection and segmentation are the basis of gesture recognition.

Gesture recognition is a key technology for gesture interaction, which directly affects the effect of gesture interaction and plays a pivotal role in the whole interaction process. The following is an introduction to gesture recognition technology.

1.1 Hand detection and segmentation

Hand detection and segmentation are the basis of hand gesture recognition. Hand detection is to detect whether there is a hand in the image and find out the specific location of the hand in the image. Hand segmentation is to segment the hand region from the image for subsequent operations and helps to reduce the computational effort. Hand detection and segmentation is the first step of gesture recognition and the foundation of gesture recognition. Generally, there are 3 characteristics of representational objects: shape, texture, and color. At a certain distance, the texture of the hand is smoother and less contrasting, so the advantage of using texture features to detect the hand is not obvious. For hand detection, currently most of the shape and color features are used to detect the hand. Therefore, common hand detection methods are classified into the following categories: shape, skin color, and motion information.

1.1.1 Shape feature-based method

Shape is an important feature to describe the content of an image. The shape of the hand is specific, so the difference in shape can be used to extract the hand from the image. It is also possible to train classifiers based on shape information using image training sets. Such methods are classification-based object detection, which usually assumes that the shape of different hand gestures is different and that this difference is much larger than the difference between different people doing the same gesture. Such methods often use features such as Histogram of Orientation Gradients (HOG), Haar wavelets, and Scale Invariant Feature Transform (SIFT).

1.1.2 Skin color-based method

Since there is a certain variability between human skin color and background, and skin color has natural translation invariance and rotation invariance, it is not affected by shooting viewpoint, pose, etc. Therefore, the method is less computationally intensive and faster, and is a common method for hand detection. However, it is easily affected by the human race, lighting, skin color-like background, etc. To use skin tone information for hand detection, the color space (RGB, HSV, YCbCr, YUV, etc.) needs to be selected first. To enhance the robustness of skin tone detection under different lighting conditions, a color space that separates the luminance and chromaticity components (e.g., HSV, YCbCr, etc.) is preferred.

1.1.3 Motion information-based method

Motion information can be used as a method for detecting hands. However, when using motion information to detect hands, there are certain requirements for people or backgrounds, such as hand movements cannot be too fast, people and backgrounds should remain relatively still, and scene lighting conditions remain stable. Assuming that the image acquisition equipment is fixed, the background is stationary or changes very little, this detection method is called static background detection. There are three main detection methods in this case: optical flow method, inter-frame differential method and background differential method. The optical flow method can obtain comprehensive scene information, not only gesture information, but also other information outside the gesture, such as scene information. Without knowing any relevant information in the image, optical flow method can also detect the motion target independently, with better independence and wider application, but it is more complicated and difficult to meet the real-time requirements without using acceleration technology. The inter-frame differential method is simpler, faster, and can eliminate the influence of external factors to a certain extent. It has better stability and lower accuracy. The extraction of target object boundaries is incomplete, and the interval between adjacent frames has higher requirements. The background differential method is simpler, faster, and can detect motion targets more completely, but the algorithm can only be applied in the case of a fixed static background of the camera. Its false detection rate is high, and the detected motion area often contains areas other than the hand (such as the arm). Motion information can be used not only alone to detect the hand, but also in combination with other visual information to detect the hand region.

1.2 Gesture recognition

Gesture recognition is a key technology for gesture interaction. It is the process of feature extraction and gesture classification of the segmented hand region. It can also be understood as the process of classifying points, or trajectories in the model parameter space to some subset of that space. Among them, a static gesture corresponds to a point in the model parameter space, and a dynamic gesture corresponds to a trajectory in the model parameter space. The gesture recognition methods are broadly classified as the following: template matching method, machine learning method, and hidden Markov model method, etc.

1.2.1 Template matching approach

The template matching method is one of the earliest and simplest pattern recognition methods, mostly used for static gesture recognition. The method is to match the input image with the template (point, curve or shape) and classify them according to the matching similarity. The matching degree calculation methods are: Euclidean distance, Hausdorff distance, pinch cosine, etc. Contour edge matching, elastic map matching, etc. are all template matching methods. The advantages of template matching methods are simple and fast, independent of lighting, background, pose, etc., and a wide range of applications, but the classification accuracy is not high, the types of gestures that can be recognized are limited, and they are suitable for small samples, shape, etc., which do not change much.

1.2.2 Machine learning based approach

Machine learning uses statistical methods to solve uncertainty problems. Machine learning is dedicated to the study of computer algorithms for generating models from data, or "learning algorithms”. With a learning algorithm, a model can be generated based on the data, and this model can be used to make appropriate judgments when facing new situations. Machine learning is developing rapidly and is a hot research area in the field of computer applications at this stage. Many episodic-based static gesture recognition uses machine learning methods. The commonly used machine learning algorithms are support vector machine method, artificial neural network method, AdaBoost method, etc.

Support vector machine is a binary classification model. Its basic model is a linear classifier defined as the maximum interval on the feature space. Support vector machines can also be extended to nonlinear classifiers using kernel methods. Its learning strategy is interval maximization, which can be formalized as solving convex quadratic programming problems such that the convex quadratic programming problem has a globally optimal solution. Artificial neural networks, born in the early 1940s, are widely parallel interconnected networks composed of simple units with adaptability, which can simulate the interactive responses made by the biological nervous system to the real world and have strong fault tolerance, robustness, high parallelism, adaptivity, anti-interference, and mobile learning capability. With the arrival of the deep learning boom, neural networks have received renewed attention and are widely used in problems such as speech recognition and image classification. There are many kinds of neural networks, and the gesture recognition rate is generally limited by hand detection models and training samples, etc. The boosting algorithm is a statistical learning method that improves weak learning algorithms into strong learning algorithms. It constructs a series of basic classifiers (weak classifiers) by iteratively modifying the weight distribution of training data, and linearly combines these basic classifiers to form a strong classification. boosting algorithm requires advanced prediction of the upper limit of weak classifier errors, which is difficult to apply in practice. AdaBoost has a wide range of applications in human detection and recognition, etc. AdaBoost has the following advantages: AdaBoost provides a framework within which subclassifiers can be constructed using a variety of methods, and simple weak classifiers can be used. AdaBoost does not require a priori knowledge of weak classifiers, nor does it need to know the upper limit of weak classifiers in advance, and the accuracy of the final strong classifier depends on the classification accuracy of all weak classifiers, which can dig deeper into the ability of weak classifiers. However, during the training process, AdaBoost causes the weights of difficult samples to increase exponentially, and the training will be overly biased towards such difficult samples, which will affect the calculation of errors and the selection of classifiers, reducing the classifier accuracy. In addition, AdaBoost is susceptible to noise interference, the execution effect depends on the selection of weak classifiers, and the training time of weak classifiers is long.

1.2.3 Hidden Markov model approach

Hidden Markov models (HMMs) are probabilistic models of temporal sequences that describe the process of generating a random sequence of unobservable states from a hidden Markov chain, and then generating a random sequence of observations by generating an observation from each state. Hidden Markov models are well suited to describe sequence models and are particularly suitable for context-dependent situations. The Hidden Markov model is an extension of the Markov chain, a dynamic Bayesian network with simple structure, and a well-known directed graph model, which is widely used as a typical method based on probability statistics in the fields of speech recognition and gesture recognition. For gesture recognition, hidden Markov models are more suitable for continuous gesture recognition, especially for complex gestures involving contexts. Hidden Markov model training and recognition are computationally intensive, especially in the analysis of continuous signals, the transition of states leads to the need to calculate a large number of probability densities and more parameters, which makes the sample training and target recognition slow. To solve this problem, discrete Hidden Markov models are used in general gesture recognition systems for analysis.

2.Conclusion

Visual-based gesture interaction is an important way of interaction between human and virtual environment, which is natural and convenient and is of great significance to the immersive experience of virtual reality. Although we have gained some achievements, there are still many problems to be solved, such as hand detection in complex backgrounds, integration with other interaction methods, and functional integration. Visual-based gesture interaction has important scientific value and broad application prospects. With the increasing demand for immersive experience in virtual reality, visual-based gesture interaction will certainly play an important role in virtual reality.

Founded in August 2020, WIMI Hologram Academy is dedicated to holographic AI vision exploration and researches basic science and innovative technologies, driven by human vision. The Holographic Science Innovation Center, in partnership with WIMI Hologram Academy, is committed to exploring the unknown technology of holographic AI vision, attracting, gathering, and integrating relevant global resources and superior forces, promoting comprehensive innovation with scientific and technological innovation as the core, and carrying out basic science and innovative technology research.

Contacts
Holographic Science Innovation Center
Email: pr@holo-science. com