TubeDETR is a new architecture for spatio-temporal video grounding that consists of an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results