VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval

Authors: Gong, Y., Cosma, G. and Finke, A.

Source: arXiv