Dense Video Captioning

Bring your soccer game experience to life with Dense Video Captioning! This cutting-edge technology highlights the most exciting moments and adds captivating commentaries, immersing you in the action like never before.

Our task.

Dense Video Captioning consists in generating coherent caption describing soccer actions occured and localizing each caption by a timestamp.

All of our classes.

Some of the comments have associated classes. We will evaluate on non-labeled comments and those with the following classes: {corner, substitution, whistle, soccer-ball, time, injury, penalty, y-card, yr-card, r-card, soccer-ball-own, penalty-missed}

Our data.

The data consists of 471 videos from soccer broadcast games available at two resolutions (720p and 224p) with captions. We also provide extracted features at 2 frames per second for an easier use, including the feature used by the 2021 challenge winners, Baidu Research. The provided data also includes original comments and versions where referees, coaches, players, and teams have been anonymized or identified, as well as team lineups. The challenge set is composed of 42 separate games.

Our Metric.

We defined the following metric for Dense Video Captioning task. For each predicted and ground truth caption, we build a time window tolerance of 30 seconds centered on the spotting timestamp (15 seconds before and 15 seconds after). We compute the standard metric for generated text between a predicted caption with any ground truth caption those time windows overlap.

This metric can be derived by the metric introduced in ActivityNet Caption. After computing the time window tolerance of 30 seconds, we use the same procedure with tIoU > 0.

For the challenge we will focus only on non-labeled captions and those with labels: {corner, substitution, whistle, soccer-ball, time, injury, penalty, y-card, yr-card, r-card, soccer-ball-own, penalty-missed}. The ranking will be based on METEOR.