* Equal contribution. † Co-corresponding author. Each image is paired with one or more text instances with polygon-level annotations. The dataset follows a consistent annotation format, detailed in ...
Abstract: Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient ...