Make-up Dense Video Captioning

Introduction

Given an untrimmed make-up video, the Make-up Dense Video Captioning (MDVC) task aims to localize and describe a sequence of makeup steps in the target video. This task requires models to both detect and describe fine-grained make-up events in a video.
Inputs: An untrimmed make-up video varies from 15s to 1h.
Outputs: The temporal boundary and the generated description of detected make-up events in the video.
Downloads and Baselines at Here.
Testing at Codalab.
Participants should register in this Form before testing.

Prize

1st prize: ¥ 1,0000
2nd prize: ¥ 3,000
3rd prize: ¥ 2,000

Important Date

Time zone: Beijing, UTC+8

April 25th, 2022
June 10th, 2022
June 25th, 2022
June 26th-30th , 2022
July 1th, 2022
July 6th, 2022
Training / validation set released
Testing set released and submission opened
Submission deadline
Objective evaluation
Evaluation results announce
Paper submission deadline

Task Rules

  1. The results should be stored in results.json, with the following format:
    {
     'video_id': [
      {
       'sentence': sent,
       'timestamp': [st_time, ed_time],
      }, ...
     ],
    }
  2. Each team can submit the results.json once a day
  3. The evaluation process can take times and a failed submission will not cause the reduction of submission chances.

Task Metric

  1. We measure both localizing and captioning ability of models. For localization performance, we compute the \(average \ precision (AP)\) across \(tIoU\) thresholds of {0.3,0.5,0.7,0.9}. For dense captioning performance, we calculate \(BLEU4\), \(METEOR\) and \(CIDEr\) of the matched pairs between generated captions and the ground truth across \(tIoU\) thresholds of {0.3, 0.5, 0.7, 0.9}.
  2. Dataset

    The makeup instructional videos are naturally more fine-grained than open-domain videos. Different steps share the similar backgrounds, but contain subtle but critical differences such as fine-grained actions, tools and applied facial areas, all of which can result in different effects to the face.
    We utilize YouMakeup dataset, which contains 2,800 make-up instructional videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of steps, including temporal boundaries, grounded facial areas and natural language descriptions of each step. There are totally 30,626 steps with 10.9 steps on average for each video. The length of videos varies from 15s to 1h with 9 min on average.

    Data Statistics

    Dataset Total Train Val Test Video_len
    YouMakeup 2800 1680 280 840 15s-1h

    Image Example

    Organizers

    Linli Yao
    Renmin University of China
    Ludan Ruan
    Renmin University of China
    Shuwei Liu
    Renmin University of China
    Shunyao Yu
    Renmin University of China
    Qin Jin
    Renmin University of China

    Citation

    @inproceedings{wang2019youmakeup,
    title={YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension},
    author={Wang, Weiying and Wang, Yongcheng and Chen, Shizhe and Jin, Qin},
    booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
    pages={5136--5146},
    year={2019}
    }