About the Workshop
The recent surge of multimodal large models has brought unprecedented progress in connecting language, vision, audio, and beyond. Yet, despite their impressive performance, most existing systems remain constrained to fixed modality pairs, lacking the flexibility to generalize or reason across arbitrary modality combinations. The first edition of the Any-to-Any Multimodal Learning (A2A-MML) workshop aims to explore the next frontier of multimodal intelligence—developing systems capable of understanding, aligning, transforming, and generating across any set of modalities. We organize the discussion around three foundational pillars:For the latest papers and datasets, please refer to Awesome-Any-to-Any-Generation. This repository is regularly updated and provides valuable resources for your submission.
Topics and Themes
We welcome all relevant submissions in the area of multimodal learning, with emphasis on any-to-any multimodal intelligence, such as:
- Multimodal Representation Learning
- Multimodal Transformation
- Multimodal Synergistic Collaboration
- Benchmarking and Evaluation for Any-to-Any Multimodal Learning
Other topics include, but are not limited to:
- Unified multimodal foundation and agentic models.
- Representation learning for embodied and interactive systems.
- Integration of underexplored modalities and cognitive perspectives on multimodal perception and reasoning.
Call for Papers
Important Dates
Important Dates for Review Process are as follows
Mar 07, 2026 AOE
Archival Paper Submission Deadline
Apr 03, 2026 AOE
Non-archival Paper Submission Deadline
Apr 18, 2026
All Workshop Paper Notification Date
Apr 24, 2026
Archival/Non-archival Paper, Camera-ready
Paper Submission and Acceptance
We welcome technical, position, or perspective papers related to the topics outlined below. All submissions must be written in English, follow the official CVPR proceedings format, and adhere to the double-blind review policy.
- Archival/Proceedings Track: Submitted papers must be 8 pages, formatted using the CVPR 2026 author guidelines, and will appear in the official CVPR workshop proceedings. If an 8-page submission is not accepted, authors may opt in to be considered for the non-archival track instead; to do so, submit only to the Archival track. Deadline for these submissions is March 07 11:59 PM, 2026 AOE.
- Non-archival Track: Submitted papers under this track can be either short (2–4 pages) or regular (up to 8 pages), including figures and tables. These will not be published with the main conference proceedings. Deadline for these submissions is April 03 11:59 PM, 2026 AOE.
Best Paper Awards will be presented based on reviewer scores and the workshop committee’s evaluation.
All accepted papers will be presented as posters during the workshop. Poster sessions will be conducted onsite with dedicated time for interactive discussions.
Invited Speakers
Tentative Schedule
Location: June 4th, 2026, Room 502
| Time | Schedule | Speaker |
|---|---|---|
| 7:55 - 8:00 | Introduction and Opening Remarks | - |
| 8:00 - 8:30 | Keynote Talk 1 | Zhedong Zheng |
|
Title: When AI Thinks Like Humans: Cognitive Biases and Uncertainty Awareness Abstract: In this talk, we explore how large multimodal models exhibit human-like cognitive biases, from the Stroop effect and other-race recognition gaps to "split-brain" functional asymmetries, revealing that AI not only learns patterns but also inherits psychological blind spots. We then introduce an uncertainty-aware framework ("Know We Don't Know") that leverages semantic entropy and multi-modal perturbations to detect and mitigate hallucinations, paving the way for safer, more reliable AI systems. |
||
| 8:30 - 9:00 | Keynote Talk 2 | Saining Xie (NYU) |
| 9:00 - 9:30 | Keynote Talk 3 | Georgia Gkioxari (Caltech) |
|
Title: Beyond Image and Language: Building 3D Perception Systems Bio: Vision-language models have transformed visual recognition, but building perception systems that understand and manipulate the 3D world requires moving beyond images, text, and object categories. In this talk, I will discuss recent work on 3D perception systems for reconstruction and editing, focusing on SAM 3D and Steer3D. These systems highlight both the promise and the challenges of learning from 3D data: unlike images and text, 3D data is difficult to obtain at scale, expensive to curate, and often limited in diversity. This motivates the need for efficient training frameworks that can make the most of available data while overcoming its limitations. I will discuss lessons on how to better model and sharpen complex 3D data distributions, and how to leverage pre-trained representations to improve generalization. Together, these works point toward a broader goal: building 3D perception systems that can reconstruct, edit, and reason about the 3D world. |
||
| 9:30 - 10:00 | Keynote Talk 4 | Mohit Bansal (UNC Chapel Hill) |
| 10:00 - 10:30 | Poster Session (Interactive) + Virtual Gallery + Coffee | |
| 10:30 - 11:00 | Keynote Talk 5 | Yossi Gandelsman (TTIC) |
| 11:00 - 11:30 | Keynote Talk 6 | Manling Li (Northwestern) |
| 11:30 - 12:00 | Keynote Talk 7 | Paul Liang (MIT) |
| 12:00 - 12:20 | Oral Presentations | - |
| 12:20 - 12:35 | Keynote Talk 8 | Sponsor (Philo Labs) |
|
Talk: Agent and World, in One Model Abstract: Most of the field treats agent and world model research as two programs. We don't. A world model is an agent whose policy you called dynamics. An agent is a world model you queried for actions. The cut between them is something you draw. We'll cover three things: 1) How we're improving VLMs' agentic capabilities in video AI. 2) How we're improving video gen models as environments. 3) What happens at the seam, and why we think this is also the right framing for safety. |
||
| 12:35 - 12:45 | Closing Remarks + Best Paper Award | TBD |
Organization
Steering Committee
Executive Commitee
Sponsors
We are grateful for the generous support from our sponsors, which enables us to host this workshop and foster a vibrant research community around any-to-any multimodal learning.