Bringing Online Egocentric Action Recognition into the wild

Politecnico di Torino
IEEE Robotics and Automation Letters (RA-L)
^*Indicates Equal Contribution
Contacts: name.surname@polito.it

Abstract

To enable a safe and effective human-robot cooperation, it is crucial to develop models for the identification of human activities. Egocentric vision seems to be a viable solution to solve this problem, and therefore many works provide deep learning solutions to infer human actions from first person videos. However, although very promising, most of these do not consider the major challenges that comes with a realistic deployment, such as the portability of the model, the need for real-time inference, and the robustness with respect to the novel domains (i.e., new spaces, users, tasks). With this paper, we set the boundaries that egocentric vision models should consider for realistic applications, defining a novel setting of egocentric action recognition in the wild, which encourages researchers to develop novel, applications-aware solutions. We also present a new model-agnostic technique that enables the rapid repurposing of existing architectures in this new context, demonstrating the feasibility to deploy a model on a tiny device (Jetson Nano) and to perform the task directly on the edge with very low energy consumption (2.4W on average at 50 fps).

Results

Frames per Second (FPS) processed with the I3D model on different devices. The Orange areas show traditional action recognition models' difficulty to run online inference on edge devices, either due to latency or hardware constraints. Our goal is to promote research toward models that can work in the Green area, allowing egocentric models to run online inference and on tiny devices.

Quantitative analysis of the effects of using action recognition models on different devices.

Illustration of the Streaming inference scenario. $TW_{T_{s}}$ represents a temporal window sliding along the video with stride 1. At each time step, a clip of $T_{s}$ contiguous frames is fed into the network, which comprises a feature extractor $F$ and a classifier $C$ with $n$ classes ($C_{1}$, $C_{2}$, ... ,$C_{n}$). $A$ represents the aggregator that - at each step - updates the output of the network, taking into consideration the current output and the previous ones. R stands for aggregator cleaning, triggered by the sample's last frame.

Illustration of the proposed two-fold aggregator ($A^{2}$) method. The two aggregators work asyncronsly, $\delta$ is a parameter used to guarantee the asyncroncity of the two and indicates the frame-delay of the DBL activation of one aggregator when the other one detects an anomaly.

BibTeX

@article{goletto2023bringing, title={Bringing Online Egocentric Action Recognition into the wild}, author={Goletto, Gabriele and Planamente, Mirco and Caputo, Barbara and Averta, Giuseppe}, journal={IEEE Robotics and Automation Letters}, year={2023}, publisher={IEEE} }