Zero-shot Motion Pattern Recognition from 4D point-clouds

Introduction

In this post, we are going to discuss the pattern recognition from spatial data in motion through a zero-shot learning scenario. We use Siamese networks architecture to recognize unseen classes of motion patterns. A graph-based approach is adopted to achieve permutation invariance and encode moving point clouds into a representation in a computationally efficient way. 

Point-clouds are acquired from mmWave Radars, RGB-D cameras, or LiDARs. They constitute unstructured sets of points, often in a three-dimensional space. Unlike pixels in images, points in point-clouds can substitute each other without changing the underlying semantic representation, which poses challenges for processing them. In this work, we process point-clouds acquired from mmWave Radars to recognize unseen classes of gestures. 

Siamese Network

A Siamese neural network is an architecture with multiple identical encoder components g_i:X\rightarrow Y, \iin\{1,2,\dots\}, each of which is responsible for calculating a latent space representation r_i for an input sample x_i . The encoders have the exact same architecture, and weights and are updated simultaneously during the training phase.


siamese model

An overall structure of a Siamese network is shown in figure above. Since a siamese network is trained to separate dissimilar pairs in the latent space, it learns discriminative features which can be exploited to realize zero-shot learning. The output is a similarity metric, which is also computed for pairs of samples from a distribution that is unknown to the model. Contrastive loss and triplet loss are the loss functions that are often used to train Siamese networks.

Contrastive Loss

Contrastive loss \mathcal{L}_{c} is a Euclidean-based similarity metric.

\mathcal{L}_{c} = y \cdot D(r_i, r_j) + (1 - y) \cdot \max\left(0, m - D(r_i, r_j)\right)

Here, y determines whether the gestures i and j belong to the same class (y=1) or not (y=0); D(r_i, r_j) is the Euclidean distance between the encoded representations of gestures i and j. m is a margin to prevent the trivial solution with \mathcal{L}_c=0, by configuring the hidden representations for all gestures onto a single point.

Triplet Loss

Triplet loss minimizes the distance between an anchor gesture and a positive (same class), while increasing the distance to a negative (different class) gesture. 

\mathcal{L}_{t} = max(0, D(r_a, r_p) - D(r_a, r_n) + m)

r_a is the encoded anchor gesture, r_p is the encoded positive gesture, r_n is the encoded negative gesture, and m is the margin to prevent a trivial solution.

Point-cloud Encoder

To encode the moving point-cloud into a latent space, we take advantage of Message Passing Neural Network (MPNN). First, we apply a temporal graph K-NN algorithm to build a graph from the input point-cloud. In this process, we connect each point to its nearest neighbors from next time frames. In the second stage, we build edge features. Finally, the updated representation of each node is calculated by applying a pooling function over the edge features of the edges incidenting at the node. The schematic view of the method and the architecture of the model are shown in the figure below.

model

Results

In this section, we discussed the results of our work. In the figure below we have results for different loss functions with different pair selection strategies. As you can see, Triplet loss with online pair selection approach performs the best. On the other hand, in the right figure, the effect of different margin values is shown. When margin is 10, we have the best results for Triplet loss with online pair selection mechanism.

loss function model
margin model

Moreover, in the figure below on the left side, we can observe the effect of different number of unseen classes on the accuracy of the model. As we increase the number of unseen classes, the accuracy decreases. Furthermore, on the right side we have compared our model to baseline and also Pantomime model. As we can see, our proposed method outperforms state of the art by a considerable margin.

number of unseen classes model
number of unseen classes

Conclusion

We have introduced our work on a neural network architecture for zero-shot motion recognition from 4D point clouds which is published in MLSP 2021 (full publication). We showed that our model outperforms state-of-the-art with a considerable margin in terms of average accuracy of un-seen classes of gestures. We also discussed different loss functions, training schemes, unseen scenarios, and margin values through a set of extensive experiments.

Read more from Dariush >

Dariush Salami

I am a Ph.D. candidate in networking and machine learning at Aalto University and Early Stage Researcher (ESR) in ITN WindMill project. I have graduated from the Amirkabir University of Technology in Artificial Intelligence (AI). My master thesis was about fraud detection in the communication industry. Now, I am working on Radio Frequency (RF) sensing using mmWave FMCW radars. Machine Learning (ML), graph based neural networks, machine vision, and Natural Language Processing (NLP) some of the topics that I am interested in.