## 1 Introduction

With the emergence of depth sensors like the Microsoft Kinect, there has been a renewed interest in the analysis of gestures using both color (or grayscale) and depth information. A standard procedure to estimate such features from videos, is to extract points of interest in the video (such as SIFT

[1]) and subsequently obtain feature descriptors using these interest points [2], which are then used to train a classifier. One of the most common methods used by researchers today in this regard is the Bag-of-Features (BoF) model

[3].The BoF model is applied upon videos that are represented by a set of interest point based descriptors. A dictionary of codewords that represents the set of descriptors from all the videos is estimated. The dictionary is usually the centers of descriptor clusters. Each descriptor in a video is then mapped to the closest codeword in the dictionary, resulting in a histogram over the codewords. Two descriptors are similar if they map to the same codeword. The similarity between two videos is approximated by the similarity between their descriptor histograms. The histogram similarity is a coarse measure due to the binarization of the belongingness of a descriptor to a cluster. Recently, Bo et al.

[4]applied kernel methods to measure descriptor similarity and devised an efficient method to use these kernels to evaluate image similarity for object recognition in images. In this paper, we show how the concept of efficient kernels can be extended to estimate a kernel-based similarity between videos.

Further, the BoF procedure has an additional drawback - it inherently leads to a global comparison of the similarity between two videos. Pooling descriptors from different locations in a video into a histogram leads to loss of spatio-temporal information. Two unique gestures that have similar movements but differ in the order they are performed, will generate similar descriptors, though at different locations. Their histograms, from the BoF model, will however be similar. These gestures will get the same class label under any classification based on these histograms. To overcome this drawback, in this paper, we further extend the efficient kernels for videos to a spatio-temporal pyramid-based multiresolution model.

In the proposed method, which we call Multiresolution Match Kernels (MMK), we bring together the concepts of efficient match kernels [4] and spatio-temporal pyramids [5] for gesture video classification. We introduce a multiresolution spatio-temporal feature representation of a video. This representation entails a histogram based similarity measure at multiple resolutions. By replacing the binary histogram with a kernel function, we obtain a fine-grained rich similarity measure. However, the computational cost of estimating kernels at multiple resolutions is high. We then derive a cost-effective formulation for MMK, motivated by [4], at multiple resolutions. We apply the MMK to classify American Sign Language(ASL) hand gestures obtained using a Microsoft Kinect. The MSRGesture3D dataset [6], has depth videos of the hands performing gestures across users. We demonstrate how MMK improves recognition accuracies with increased pyramid resolution and show how it performs better than the existing techniques.

The remainder of this paper is organized as follows. Section 2 discusses earlier related work, followed by a description of the proposed MMK method in Section 3. Section 4 presents our experimental results, followed by our conclusions in Section 5.

## 2 Related Work

Gesture recognition has been studied extensively for several decades now, and has been applied in various fields ranging from social robotics to sign language recognition. Until recently, gesture recognition was mostly studied using color(RGB) (or grayscale) videos. The introduction of the Kinect has renewed interest in gesture recognition using color and depth(RGB-D) videos. Weinland at al. [2] provide a survey of gesture recognition techniques with a unique classification of the different procedures used. One of the foremost techniques for video feature representation, the BoF, was popularized by Laptev et al. [3], where Histogram of Oriented Gradients (HOG) and Histogram of Optical Flow (HOF) based descriptors were estimated from local neighborhoods of spatio-temporal interest points in a video. A BoF procedure was then applied to estimate histogram vectors that represented the videos.

Grauman and Darrell [7] extended the standard BoF by introducing a spatial pyramid for object recognition. The inclusion of spatial information into a pyramid based BoF, was formalized by Lazebnik et al. [5]. Spatio-temporal pyramids for videos have been implemented in [8, 9] which use temporal pyramid alignment and histogram intersection respectively, to estimate video similarities. In this paper, we improve upon the pyramid similarity measure by using Multiresolution Match Kernels (MMK). We use the idea of efficient kernels [4] to overcome the coarse binning shortcoming of the BoF by efficiently estimating kernel based similarity between descriptors across videos. We now describe our method.

## 3 Multiresolution Match Kernels

In the BoF procedure, every gesture video is represented as a histogram. The data being binned is the set of feature descriptors from the video and the bin centers are a set of dictionary vectors. Let represent the feature descriptors in a video, and , be a set of dictionary vectors. A feature descriptor is quantized into a binary vector . is 1 if and 0 otherwise, where (as in [4]). The histogram feature vector for is then given by A linear kernel measuring video similarity between two videos and is a dot product over the histogram vectors, given by . This is the standard BoF similarity measure for two videos.
can be expressed as a positive definite function , where is 1 if , and 0 otherwise.

As mentioned earlier, the BoF representation of does not retain information about the spatio-temporal locations of the original feature descriptors. To overcome this drawback, we propose a multiresolution spatio-temporal pyramid representation for videos. A video is represented as a single cuboid of dimensions . We split the video cuboid into smaller cuboids called voxels. At level the video is made up of only 1 voxel. At Level the video is split into voxels. At level the video has voxels. A spatio-temporal pyramid at level consists of a sequence of all the voxels generated from levels . The number of voxels for a spatio-temporal pyramid is therefore . Figure. 1, represents video voxels for levels. The dots refer to the interest points at that fall within different voxels at different levels. We represent the descriptors for a given level as an ordered partition over the descriptor set in the video. An ordered partition for level is represented as where is the number of partitions(voxels) at level , with . For , where is the set of descriptors at level and voxel , the binary vector representation is given as . For the sake of simplicity, we deploy the same dictionary for quantization. The histogram of descriptors for voxel is represented as . The similarity measure between two corresponding voxels (both labeled ) from two videos and is given as . For a given level, the multiresolution BoF similarity between two videos and , is the sum of the similarities of histogram vectors from corresponding voxels in videos and . A linear kernel measuring the multiresolution BoF similarity between two videos is now the sum over multiple dot products spanning over corresponding voxels at all the levels. It is given by:

(1) |

Unlike in the standard BoF, the kernel in Equation (1) measures the similarity between two videos at multiple resolutions, maintaining locality in spatial and temporal dimensions. Greater the number of levels, the more finer the comparison between the videos. When , Equation (1) reduces to the standard BoF similarity measure used for videos as mentioned at the beginning.

We now introduce the Multiresolution Match Kernels. The dot product in Equation (1) can be replaced by the positive definite function , as discussed previously. Since is a coarsely quantized estimate of the similarity between and , we replace it by a finite-dimensional kernel function , to estimate the similarity between and [4]. This modifies Equation (1) to:

(2) |

Equation (2) can be viewed as a sum of kernels for the same descriptors over multiple levels.

Let us define where for or . Equation (2) can then be written as , where is a weight for the kernel at Level . We intend to give more weight to kernel similarities at finer resolutions(higher levels). The kernel can be viewed as a linear combination of kernel functions across levels and voxels, which we call a Multiresolution Match Kernel (MMK). The computational cost of estimating MMKs through Equation (2) is prohibitively high, viz. and to estimate the kernel similarity between two videos and to estimate the kernel gram matrix for videos, respectively, where is the average number of descriptors in a video and is the dimensionality of the descriptors. To overcome this computational overhead we estimate the feature mapping induced by the finite dimensional kernel and work in the mapped feature space as in [4]. We now proceed to estimate the feature representations induced by the kernel .

Gonen and Alpaydin stated in [10] that a linear combination of kernels directly corresponds to a concatenation in the feature space. Therefore, the MMK induces a feature space that is a concatenation of the feature spaces induced by the individual kernels . That is, if induces a feature space , where , the feature space induced by the MMK, is , with . We now need to estimate the feature map induced by the kernel . This involves the estimation of the feature map induced by . We estimate as in [4]. Given , is estimated as the mean of all the that belong to voxel . is then equal to , with .

In estimating , we determine a set of basis vectors(dictionary) that maps the space of all ,. We denote this by a set of vectors, . We estimate

using K-means clustering with

centers. is given by , where is a vector with . is a matrix with for some positive definite kernel (for our experiments we use a RBF kernel). For , the feature map . The feature map for voxel is then, . Please refer [4] for more details.The computational complexity of estimating is now . is obtained by concatenating the . is the new multiresolution spatio-temporal representation of a video . If dictionary length is , and the pyramid has levels, is of dimensions (e.g., for a dictionary of length , and spatio-temporal pyramid level , the total length of the feature map for a MMK is 360 - a level pyramid includes levels and also). This is a significant improvement in making the MMKs practical and efficient.

## 4 Experiments and Results

The MSRGesture3D dataset [6] consists of depth gesture videos depicting letters and words from the American Sign Language, viz. *bathroom, blue, finish, green, hungry, milk, past, pig, store, where, j, z*. It is performed by users and has examples. Figure. 2 depicts a snapshot from an example of each gesture across different users. In this work, we extract interest points using SIFT [1], from every frame in the depth video. Every frame is a gray scale image where the pixel value denotes the depth of the pixel in the image. The coordinates of these points are used to group them into voxels ( are pixel coordinates, is the frame number). For this study we experiment with only 3 levels for the spatio-temporal pyramid. We apply MMK upon three kinds of feature descriptors (SIFT itself, Local Binary Patterns(LBP) and HOG) extracted from the aforementioned interest points to validate the performance of the proposed method.

SIFT descriptors were estimated using a window of pixels around each interest point. Each interest point yielded a dimension descriptor (as in [1]). LBP features [11] were estimated from the same pixel window around each interest point. For each pixel in the window, a -bit LBP binary vector was estimated. The decimal values of the binary vectors were binned into a histogram with bins to result in a dimensional descriptor for each interest point. In case of HOG, for each of the pixels in the pixel window around the interest point, the gradient orientations were estimated and binned into a histogram with bins. We apply the soft binning strategy from [12] to distribute the orientation about a set of bins. The feature descriptor for each interest point is a vector of components.

For each kind of descriptor, we estimated the dictionary of codewords using K-means clustering. We empirically studied MMK with different values and estimated the best performance to be at (see Figure. 3). Feature maps for level were weighted with giving higher weight to matches at finer resolutions. For spatio-temporal pyramid (which includes and ), the dimension of feature map , was . We use the LIBLINEAR classifier (as used by Bo et al. [4]) upon the features for final gesture classification. Table (1) shows the results of our experiments. Here, BoF is the standard Bag-of-Features method, while MMK-1, MMK-2 and MMK-3 are MMK with and , respectively. These results were obtained using cross-validation, where subjects were used for training and subjects for testing. The obtained results were averaged over independent runs to address randomness bias.

It is evident that the proposed MMK method outperforms BoF in these experiments. Also, the results show that the accuracies increase at higher resolutions of matching, corroborating the need for multiresolution match kernels. The confusion matrix for SIFT and MMK-3, shown in Figure. 4, indicates that the gesture for *Milk* was often misclassified. We believe that this is because the gesture involves a clenched fist whose descriptors are similar to descriptors from other gestures (like *letter-J*). We will investigate ways to overcome this challenge in future work. Existing state-of-the-art on the MSRGesture3D dataset report accuracies [6] and [13] using a leave-one-subject-out approach. We, hence, performed a study based on the leave-one-subject-out approach using the proposed MMK-3 method and obtained accuracies of (SIFT), (LBP) and (HOG), supporting our inference that the proposed method holds great promise for video-based classification tasks.

BoF | MMK-1 | MMK-2 | MMK-3 | |
---|---|---|---|---|

SIFT | 71.28% | 72.3% | 87.56% | 91.66% |

LBP | 66.15% | 68.07% | 85.64% | 90.51% |

HOG | 74.10% | 58.33% | 74.87% | 88.85% |

## 5 Conclusions

We have proposed a Multiresolution Match Kernel framework for efficient and accurate estimation of video gesture similarity. Our method addresses two major drawbacks in the widely used Bag-of-Features method, viz., coarse similarity matching and spatio-temporal invariance. Our experiments on the MSRGesture3D dataset showed that the MMK performs significantly better than BoF across feature descriptors: SIFT, LBP and HOG. MMK provides a discriminative (multiresolution spatio-temporal) feature representation for videos with similar motion information that differ in the sequence of motion, where a standard BoF procedure would perform poorly. We plan to extend this method to other video-based classification tasks to study the generalizability of this approach in future work.

## 6 Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No:1116360. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

## References

- [1] D. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.
- [2] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recognition,” Computer Vision and Image Understanding, vol. 115, no. 2, pp. 224–241, 2011.
- [3] I. Laptev, “On space-time interest points,” IJCV, vol. 64, no. 2, pp. 107–123, 2005.
- [4] L. Bo and C. Sminchisescu, “Efficient match kernel between sets of features for visual recognition,” NIPS-09, vol. 2, no. 3, 2009.
- [5] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in IEEE CVPR, 2006, vol. 2, pp. 2169–2178.
- [6] A. Kurakin, Z. Zhang, and Z. Liu, “A real time system for dynamic hand gesture recognition with a depth sensor,” In: EUSIPCO, 2012.
- [7] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in IEEE ICCV, 2005, vol. 2, pp. 1458–1465.
- [8] D. Xu and S. Chang, “Visual event recognition in news video using kernel methods with multi-level temporal alignment,” in IEEE CVPR, 2007, pp. 1–8.
- [9] J. Choi, W. Jeon, and S. Lee, “Spatio-temporal pyramid matching for sports videos,” in Proceedings of the 1st ACM international conference on Multimedia information retrieval, 2008, pp. 291–297.
- [10] M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” JMLR, vol. 12, pp. 2211–2268, 2011.
- [11] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on PAMI, vol. 24, no. 7, pp. 971–987, 2002.
- [12] L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recognition,” NIPS-10, vol. 7, 2010.
- [13] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3d action recognition with random occupancy patterns,” Computer Vision-ECCV, pp. 872–885, 2012.

Comments

There are no comments yet.