HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition

oleh: Li Jiang, Jiahao Yu, Yuanjie Dang, Peng Chen, Ruohong Huan

Format: Article
Diterbitkan: MDPI AG 2023-04-01

Deskripsi

Although the existing few-shot action recognition methods have achieved impressive results, they suffer from two major shortcomings. (a) During feature extraction, few-shot tasks are not distinguished and task-irrelevant features are obtained, resulting in the loss of task-specific critical discriminative information. (b) During feature matching, information critical to the features within the task, i.e., self-information and mutual information, is ignored, resulting in the accuracy being affected by redundant or irrelevant information. To overcome these two limitations, we propose a hierarchical task information mining (HiTIM) approach for few-shot action recognition that incorporates two key components: an inter-task learner (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mi mathvariant="normal">K</mi><mi>inter</mi></msup></semantics></math></inline-formula>) and an attention-matching module with an intra-task learner (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mi mathvariant="normal">K</mi><mi>intra</mi></msup></semantics></math></inline-formula>). The purpose of the <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mi mathvariant="normal">K</mi><mi>inter</mi></msup></semantics></math></inline-formula> is to learn the knowledge of different tasks and build a task-related feature space for obtaining task-specific features. The proposed matching module with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mi mathvariant="normal">K</mi><mi>intra</mi></msup></semantics></math></inline-formula> consists of two branches: the spatiotemporal self-attention matching (STM) and correlated cross-attention matching (CM), which reinforce key spatiotemporal information in features and mine regions with strong correlations between features, respectively. The shared <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msup><mi mathvariant="normal">K</mi><mi>intra</mi></msup></semantics></math></inline-formula> can further optimize STM and CM. In our method, we can use either a 2D convolutional neural network (CNN) or 3D CNN as embedding. In comparable experiments using two different embeddings in the five-way one-shot and five-way five-shot task, the proposed method achieved recognition accuracy that outperformed other state-of-the-art (SOTA) few-shot action recognition methods on the HMDB51 dataset and was comparable to SOTA few-shot action recognition methods on the UCF101 and Kinetics datasets.