CVPR相关比赛调研

CVPR相关比赛调研

Skywalker

Ego4d

Episodic memory(情景记忆)

Moments queries (MQ)

比赛链接

https://eval.ai/web/challenges/challenge-page/1626/overview

比赛内容

主要任务

This task aims to query an egocentric video based on a category of actions. Specifically, it poses the following request ‘Retrieve all the moments that I do X in the video.’, where ‘X’ comes from a pre-defined taxonomy of action categories, such as ‘interact with someone’ or ‘use phone’. Given an input video and a query action category, the goal is to retrieve all the instances of this action category in the video.

即:输入一个视频片段和查询动作,输出该动作的起始和结束时间。

image-20250428235527898

比赛举办机构

Ego4D

2023冠军代码(目前榜单第四):https://github.com/JonnyS1226/ego4d_asl

2023冠军论文:https://arxiv.org/abs/2306.09172

2024代码未公开,仓库在:https://github.com/OpenGVLab/EgoVideo

Goal Step

比赛链接

https://eval.ai/web/challenges/challenge-page/2188/overview

比赛内容

主要任务

There are three tasks of interest in Ego4D Goal-Step (1) Goal/Step localization (2) Online goal/step detection (3) Step grounding. In this challenge, we focus on Step grounding. Given an untrimmed egocentric video, identify the temporal action segment corresponding to a natural language description of the step. Specifically, predict the (start_time, end_time) for a given keystep description.

即:focus on Step grounding ,预测给定的关键步骤的开始和结束时间。

比赛举办机构

Ego4D

暂时没找到

Ego Schema

比赛链接

https://eval.ai/web/challenges/challenge-page/2238/overview

比赛内容

主要任务

We introduce EgoSchema, a very long-form video question-answering benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. More details at our website and Github page.

image-20250429152325635

即:给定一个3分钟的视频切片,一个问题和5个选项,从五个选择中选择最正确的选项。

比赛举办机构

Ego4D

Social Understanding

Looking at me

比赛链接

https://eval.ai/web/challenges/challenge-page/1624/overview

比赛内容

主要任务

An egocentric video provides a unique lens for studying social interactions because it captures utterances and nonverbal cues from each participant’s unique view and enables embodied approaches to social understanding. Progress in egocentric social understanding could lead to more capable virtual assistants and social robots. Computational models of social interactions can also provide new tools for diagnosing and treating disorders of socialization and communication such as autism, and could support novel prosthetic technologies for the hearing-impaired.

While the Ego4D dataset can support such a long-term research agenda, our looking-at-me task focuses on identifying communicative acts that are directed towards the camera-wearer, as distinguished from those directed to other social partners: given a video in which the faces of social partners have been localized and identified, classify whether each visible face is looking at the camera wearer.

即:给定地面状况(用唯一ID跨帧追踪每个人脸,对于每个可见的人,标注了看向摄像头佩戴者的时间段),输出对于每一帧是否看向了摄像头的佩戴者(二进制)。

image-20250429153426224

比赛举办机构

Ego4D

暂时没找到

Talking to me

比赛链接

https://eval.ai/web/challenges/challenge-page/2238/overview

比赛内容

主要任务

An egocentric video provides a unique lens for studying social interactions because it captures utterances and nonverbal cues from each participant’s unique view and enables embodied approaches to social understanding. Progress in egocentric social understanding could lead to more capable virtual assistants and social robots. Computational models of social interactions can also provide new tools for diagnosing and treating disorders of socialization and communication such as autism, and could support novel prosthetic technologies for the hearing-impaired.

While the Ego4D dataset can support such a long-term research agenda, our initial Social benchmark focuses on multimodal understanding of conversational interactions via attention and speech. Specifically, we focus on identifying communicative acts that are directed towards the camera-wearer, as distinguished from those directed to other social partners: Talking to me (TTM): given a video and audio segment with the same tracked faces and an additional label that identifies speaker status, classify whether each visible face is talking to the camera wearer. he TTM task is defined as a frame-level prediction y, which stands in contrast to audio analysis tasks where labels are often assigned at the level of audio frames or segments. A desired model must be able to make a consolidated decision based on the video and audio cues over the time course of an utterance. For example, if the speaker turns their head to the side momentarily while speaking to the camera-wearer, then a frame where the speaker is looking away would have y = 1.

image-20250429154316596

即:给定视频和音频片段且该说话者已被标注为正在说话,输出对于每一帧是否对摄像头佩戴者说话(二进制)。

比赛举办机构

Ego4D

Forecasting

Short Term object interaction anticipation

比赛链接

https://eval.ai/web/challenges/challenge-page/1623/overview

比赛内容

主要任务

This task aims to predict the next human-object interaction happening after a given timestamp. Given an input video, the goal is to anticipate:

  • The spatial positions of the active objects, among those which are in the scene (e.g., bounding boxes around the objects). We consider as the next active object, the next object which will be touched by the user (either with their hands or with a tool) to initiate an interaction;
  • The category of each of the detected next active objects (e.g., “knife”, “tomato”);
  • How each active object will be used, i.e., what action will be performed on the active objects (e.g., “take”, “cut”);
  • When the interaction with each object will begin (e.g., “in 1 second”, “in 0.25 seconds”). This is the time to the first frame in which the user touches the active object (time to contact). This prediction can be useful in scenarios which involve human-machine collaboration. For instance, an assistive system could give an alert if a short time to action is predicted for a potentially dangerous object to touch.

In this task, models are required to make predictions at a specific timestamp, rather than densely throughout the video. The model is allowed to process the video up to a given frame t, at which point it must anticipate the next active objects, and how they will take part in an interaction in Δ seconds, where Δ is unknown. The model can make zero or more predictions. Each prediction indicates the next active object in terms of a noun class, a bounding box, a verb indicating the future action, as well as the time to contact, which estimates how many seconds in the future the interaction with the object will begin. Each prediction also comprises a confidence score used for evaluation.

即:给定一段未裁剪的视频和时间戳,模型要处理视频直到第t帧数,并预测在第t帧后发生的人-物交互(交互的物品,交互的动作,交互距离t的时间)。

比赛举办机构

Ego4D

2024代码(排名第三):https://github.com/KeenyJin/SOIA-DOD

image-20250429160358087

此代码论文:https://arxiv.org/abs/2407.05713

Long-term activity prediction

比赛链接

https://eval.ai/web/challenges/challenge-page/1598/overview

比赛内容

主要任务

This task aims to predict the next Z future actions after a given action. Given an input video up to a particular timestep (corresponding to the last visible action), the goal is to predict a list of action classes [(verb1, noun1), (verb2, noun2) … (verbZ, nounZ)] that follow it. The model should generate K such lists to account for variations in action sequences. The evaluation metric will consider the best of these K. For this task, we set Z=20 and K=5.

即:给一个未裁剪的视频V和最后一个可见动作,模型需要预测从该时间点后发生的未来连续Z个动作,并以动词-名词对输出。

image-20250429161037822

比赛举办机构

Ego4D

当前榜单第四代码:https://github.com/zeyun-zhong/querymamba

当前榜单第四论文:https://arxiv.org/abs/2407.04184

Object Instance Detection Challenge @ CVPR2025

比赛链接

https://eval.ai/web/challenges/challenge-page/2478/overview

比赛内容

主要任务

Instance Detection (InsDet) is a practically important task in robotics applications, e.g., elderly-assistant robots need to fetch specific items from a cluttered kitchen, micro-fulfillment robots for the retail need to pick items from mixed boxes or shelves. Different from Object Detection (ObjDet) detecting all objects belonging to some predefined classes, InsDet aims to detect specific object instances defined by some examples capturing the instance from multiple views.

即:从多个视角捕获特定实例。

比赛举办机构

cvpr

Object Instance Detection Challenge @ CVPR2025

比赛链接

https://eval.ai/web/challenges/challenge-page/2478/overview

比赛内容

主要任务

Instance Detection (InsDet) is a practically important task in robotics applications, e.g., elderly-assistant robots need to fetch specific items from a cluttered kitchen, micro-fulfillment robots for the retail need to pick items from mixed boxes or shelves. Different from Object Detection (ObjDet) detecting all objects belonging to some predefined classes, InsDet aims to detect specific object instances defined by some examples capturing the instance from multiple views.

即:从多个视角捕获特定实例。

比赛举办机构

cvpr

  • Title: CVPR相关比赛调研
  • Author: Skywalker
  • Created at : 2025-04-28 23:08:04
  • Updated at : 2025-04-30 20:53:58
  • Link: https://skywalker.github.io/2025/04/28/比赛调研/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments