SIG-Former: monocular surgical instruction generation with transformers

Authors: Zhang, J., Nie, Y., Chang, J. and Zhang, J.J.

Journal: International Journal of Computer Assisted Radiology and Surgery

Volume: 17

Issue: 12

Pages: 2203-2210

eISSN: 1861-6429

ISSN: 1861-6410

DOI: 10.1007/s11548-022-02718-9

Abstract:

Purpose:: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. Methods:: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual–textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. Results:: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. Conclusion:: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images.

https://eprints.bournemouth.ac.uk/37330/

Source: Scopus

SIG-Former: monocular surgical instruction generation with transformers.

Authors: Zhang, J., Nie, Y., Chang, J. and Zhang, J.J.

Journal: Int J Comput Assist Radiol Surg

Volume: 17

Issue: 12

Pages: 2203-2210

eISSN: 1861-6429

DOI: 10.1007/s11548-022-02718-9

Abstract:

PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. METHODS: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual-textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. RESULTS: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. CONCLUSION: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images.

https://eprints.bournemouth.ac.uk/37330/

Source: PubMed

SIG-Former: monocular surgical instruction generation with transformers

Authors: Zhang, J., Nie, Y., Chang, J. and Zhang, J.J.

Journal: INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY

Volume: 17

Issue: 12

Pages: 2203-2210

eISSN: 1861-6429

ISSN: 1861-6410

DOI: 10.1007/s11548-022-02718-9

https://eprints.bournemouth.ac.uk/37330/

Source: Web of Science (Lite)

SIG-Former: monocular surgical instruction generation with transformers.

Authors: Zhang, J., Nie, Y., Chang, J. and Zhang, J.J.

Journal: International journal of computer assisted radiology and surgery

Volume: 17

Issue: 12

Pages: 2203-2210

eISSN: 1861-6429

ISSN: 1861-6410

DOI: 10.1007/s11548-022-02718-9

Abstract:

Purpose

Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images.

Methods

Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual-textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training.

Results

We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations.

Conclusion

Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images.

https://eprints.bournemouth.ac.uk/37330/

Source: Europe PubMed Central

SIG-Former: monocular surgical instruction generation with transformers.

Authors: Zhang, J., Nie, Y., Chang, J. and Zhang, J.J.

Journal: International Journal of Computer Assisted Radiology and Surgery

Volume: 17

Pages: 2203-2210

ISSN: 1861-6410

Abstract:

PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. METHODS: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual-textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. RESULTS: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. CONCLUSION: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images.

https://eprints.bournemouth.ac.uk/37330/

Source: BURO EPrints