Human-centric Spatio-Temporal Video Grounding


Given an untrimmed video and a description depicting the object, the spatio-temporal video grounding(STVG) task aims to localize the spatio-temporal tube of the target object related to the description, which is a crucial task entailing the visual-language cross-modal comprehension.
Inputs: A video of 20 s and a description coressponding to a human
Outputs: The start frame and the end frame number with the bounding boxes of the target person during the video clip.
Downloads at DATA.
Participants should register in the this Form before testing.


1st prize: CN¥ 2,0000
2nd prize: CN¥ 5,000
3rd prize: CN¥ 2,000

Important Date

Time zone: Beijing, UTC+8

April 1th, 10:00:00, 2021
May 6st, 10:00:00, 2021
June 4th, 10:00:00, 2021
June 8th, 10:00:00, 2021
June 20th, 2021
Training / validation set released
Testing set released and submission opened
Submission deadline
Challenge winners notified
Winners present at CVPR 2021 Workshop

Task Rules

  1. The results should be stored in results.json, with the following format:
     'video_id': {
     'st_frame': st,
     'ed_frame': ed,
     'bbox': {
      'st': [x1, y1, x2, y2],
      'st+1':[x1, y1, x2, y2],
      'ed':[x1, y1, x2, y2],
  2. You have 10 submission chances in total.
  3. The evaluation process can takes times. And a failed submission will not cause the reduction of submission chances.

Task Metric

  1. \(vIoU\): \(vIoU = \frac{1}{\left | S_u \right |} \sum_{t \in S_i} IoU(Box^t, Box^{t'})\), where \(S_i\) is the set of frames in the intersection of selected and ground truth tube, \(S_u\) is the set of frames in the union of selected and ground truth tube, \(Box^t\) and \(Box^{t'}\) are predicted bounding box and ground truth bounding box of frame \(t\). vIoU can directly reflect the accuracy of the prediction results spatiotemporally.
  2. \(vIoU@R\) stands for the percentage of samples whose \(vIoU\) is larger than \(R\)
  3. \(mvIoU\) stands for mean value of \(vIoU\).
  4. Dataset

    Human-centric Spatio-Temporal Video Grounding(HC-STVG), which only focuses human in the videos. We provide 16k annotation-video paris with different movie scenes. Specifically, we annotated the description statement and all the trajectories of the corresponding person (a series of Bounding Boxes). It's worth noting that all of our clips will include multiple people to increase the challenge of video comprehension.

    Data Statistics

    Dataset Total Trainval Test Video_len
    HCVG 16685 12000 4665 20s

    Image Example



    title={Human-centric Spatio-Temporal Video Grounding With Visual Transformers},
    author={Tang, Zongheng and Liao, Yue and Liu, Si and Li, Guanbin and Jin, Xiaojie and Jiang, Hongxu and Yu, Qian and Xu, Dong},
    journal={arXiv preprint arXiv:2011.05049},