The task of simultaneously classifying, segmenting, and tracking multiple object instances in videos is referred to as video instance segmentation (VIS). Modern VIS transformers (VisTR) use a per-clip approach and have shown impressive end-to-end performance but suffer from long training times and high computation costs due to their frame-wise dense…