Graph Neural Networks for Joint Action Recognition, Prediction and Motion Forecasting for Industrial Human-Robot Collaboration

Conference: ISR Europe 2023 - 56th International Symposium on Robotics
09/26/2023 - 09/27/2023 at Stuttgart, Germany

Proceedings: ISR Europe 2023

Pages: 8Language: englishTyp: PDF

Authors:
Lagamtzis, Dimitrios; Schmidt, Fabian; Dang, Thao; Schober, Steffen (University of Applied Sciences, Esslingen, Germany)
Seyler, Jan (Festo SE & Co. KG, Esslingen, Germany)

Abstract:
Reasoning about human intentions, especially human action recognition, prediction, and motion forecasting is becoming increasingly successful, in particular with the use of graphs. This success is something we want to transfer to the context of industrial Human-Robot Collaboration (HRC), in which humans and robots work closely together and interact within defined workspaces using workpieces. Therefore, it is essential to use all the information that can be extracted in a workspace and to represent it with a structure that is natural and can be used for learning, for instance graphs. These need to be constructed in a human-centered manner, as is the case in HRC, and also contain real-world 3D information and object labels to describe the environment. Being able to reason about the human’s future motion and understanding what actions the human is performing and is going to perform are strongly correlated and the key to understanding the human’s intention. Therefore, we present a novel Graph Neural Network (GNN) architecture which combines human action recognition, prediction, and motion forecasting for industrial HRC environments. We evaluate our method on the publicly available Collaborative Action dataset, which contains particularly realistic representations of industrial HRC, and compare the results to baseline methods for classifying the current human action (recognition and prediction) and forecasting the human motion. Our experiments show that it is possible to jointly train multiple machine learning problems with a single encoder that learns a rich latent space representation. Moreover, we can achieve similarly good results compared to previous work, with the advantage of only having to use a single model instead of one for each machine learning problem. In addition, it is now possible to predict actions instead of only recognizing the occurring as in previous work.