MS-KARD: A Benchmark for Multimodal Karate Action Recognition

Authors: Yadav, S.K., Deshmukh, A., Gonela, R.V., Kera, S.B., Tiwari, K., Pandey, H.M. and Akbar, S.A.

Journal: Proceedings of the International Joint Conference on Neural Networks

Volume: 2022-July

ISBN: 9781728186719

DOI: 10.1109/IJCNN55064.2022.9892646

Abstract:

Classifying complex human motion sequences is a major research challenge in the domain of human activity recognition. Currently, most popular datasets lack a specialized set of classes pertaining to similar action sequences (in terms of spatial trajectories). To recognize such complex action sequences with high inter-class similarity, such as those in karate, multiple streams are required. To fulfill this need, we propose MS-KARD, a Multi-Stream Karate Action Recognition Dataset that uses multiple vision perspectives, as well as sensor data - accelerometer and gyroscope. It includes 1518 video clips along with their corresponding sensor data. Each video was shot at 30fps and lasts around one minute, equating to a total of 2,814,930 frames and 5,623,734 sensor data samples. The dataset has been collected for 23 classes like Jodan Zuki, Oi Zuki, etc. The data acquisition setting involves the combination of 2 orthogonal web cameras and 3 wearable inertial sensors recording both vision and inertial data respectively. The aim of this dataset is to aid research that deals with recognizing human actions that have similar spatial trajectories. The paper describes statistics of the dataset, acquisition setting, and provides baseline performance figures using popular action recognizers. We propose an ensemble-based method, KarateNet, that performs decision-level fusion on the two input modalities (vision and sensor data) to classify actions. For the first stream, the RGB frames are extracted from the videos and passed into action recognition networks like Temporal Segment Network (TSN) and Temporal Shift Module (TSM). For the second stream, the sensor data is converted into a 2-D image and fed into a Convolutional Neural Network (CNN). The results reported were obtained on performing a fusion of the 2 streams. We also report results on ablations that use fusion with various input settings. The dataset and code will be made publicly available.

https://eprints.bournemouth.ac.uk/36996/

Source: Scopus

MS-KARD: A Benchmark for Multimodal Karate Action Recognition

Authors: Yadav, S.K., Deshmukh, A., Gonela, R.V., Kera, S.B., Tiwari, K., Pandey, H.M. and Akbar, S.A.

Journal: 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)

ISSN: 2161-4393

DOI: 10.1109/IJCNN55064.2022.9892646

https://eprints.bournemouth.ac.uk/36996/

Source: Web of Science (Lite)

MS-KARD: A Benchmark for Multimodal Karate Action Recognition

Authors: Yadav, S., Deshmukh, A., Gonela, R., Kera, S., Tiwari, K., Pandey, H. and Akbar, S.A.

Conference: IEEE WCCI 2022 International Joint Conference on Neural Networks (IJCNN 2022)

Dates: 18-23 July 2022

Journal: IEEE Explorer

Abstract:

Classifying complex human motion sequences is a major research challenge in the domain of human activity recognition. Currently, most popular datasets lack a specialized set of classes pertaining to similar action sequences (in terms of spatial trajectories). To recognize such complex action sequences with high inter-class similarity, such as those in karate, multiple streams are required. To fulfill this need, we propose MS-KARD, a Multi-Stream Karate Action Recognition Dataset that uses multiple vision perspectives, as well as sensor data - accelerometer and gyroscope. It includes 1518 video clips along with their corresponding sensor data. Each video was shot at 30fps and lasts around one minute, equating to a total of 2,814,930 frames and 5,623,734 sensor data samples. The dataset has been collected for 23 classes like Jodan Zuki, Oi Zuki, etc. The data acquisition setting involves the combination of 2 orthogonal web cameras and 3 wearable inertial sensors recording both vision and inertial data respectively. The aim of this dataset is to aid research that deals with recognizing human actions that have similar spatial trajectories. The paper describes statistics of the dataset, acquisition setting, and provides baseline performance figures using popular action recognizers. We propose an ensemble-based method, KarateNet, that performs decision-level fusion on the two input modalities (vision and sensor data) to classify actions. For the first stream, the RGB frames are extracted from the videos and passed into action recognition networks like Temporal Segment Network (TSN) and Temporal Shift Module (TSM). For the second stream, the sensor data is converted into a 2- D image and fed into a Convolutional Neural Network (CNN).

The results reported were obtained on performing a fusion of the 2 streams. We also report results on ablations that use fusion with various input settings. The dataset and code will be made publicly available.

https://eprints.bournemouth.ac.uk/36996/

Source: Manual

MS-KARD: A Benchmark for Multimodal Karate Action Recognition

Authors: Yadav, S., Deshmukh, A., Gonela, R., Kera, S., Tiwari, K., Pandey, H. and Akbar, S.A.

Conference: IEEE WCCI 2022 International Joint Conference on Neural Networks (IJCNN 2022)

Abstract:

Classifying complex human motion sequences is a major research challenge in the domain of human activity recognition. Currently, most popular datasets lack a specialized set of classes pertaining to similar action sequences (in terms of spatial trajectories). To recognize such complex action sequences with high inter-class similarity, such as those in karate, multiple streams are required. To fulfill this need, we propose MS-KARD, a Multi-Stream Karate Action Recognition Dataset that uses multiple vision perspectives, as well as sensor data - accelerometer and gyroscope. It includes 1518 video clips along with their corresponding sensor data. Each video was shot at 30fps and lasts around one minute, equating to a total of 2,814,930 frames and 5,623,734 sensor data samples. The dataset has been collected for 23 classes like Jodan Zuki, Oi Zuki, etc. The data acquisition setting involves the combination of 2 orthogonal web cameras and 3 wearable inertial sensors recording both vision and inertial data respectively. The aim of this dataset is to aid research that deals with recognizing human actions that have similar spatial trajectories. The paper describes statistics of the dataset, acquisition setting, and provides baseline performance figures using popular action recognizers. We propose an ensemble-based method, KarateNet, that performs decision-level fusion on the two input modalities (vision and sensor data) to classify actions. For the first stream, the RGB frames are extracted from the videos and passed into action recognition networks like Temporal Segment Network (TSN) and Temporal Shift Module (TSM). For the second stream, the sensor data is converted into a 2- D image and fed into a Convolutional Neural Network (CNN).

The results reported were obtained on performing a fusion of the 2 streams. We also report results on ablations that use fusion with various input settings. The dataset and code will be made publicly available.

https://eprints.bournemouth.ac.uk/36996/

Source: BURO EPrints