A2A-MML 2026

About the Workshop

The recent surge of multimodal large models has brought unprecedented progress in connecting language, vision, audio, and beyond. Yet, despite their impressive performance, most existing systems remain constrained to fixed modality pairs, lacking the flexibility to generalize or reason across arbitrary modality combinations. The first edition of the Any-to-Any Multimodal Learning (A2A-MML) workshop aims to explore the next frontier of multimodal intelligence—developing systems capable of understanding, aligning, transforming, and generating across any set of modalities. We organize the discussion around three foundational pillars:

For the latest papers and datasets, please refer to Awesome-Any-to-Any-Generation. This repository is regularly updated and provides valuable resources for your submission.

Topics and Themes

We welcome all relevant submissions in the area of multimodal learning, with emphasis on any-to-any multimodal intelligence, such as:

Multimodal Representation Learning
Multimodal Transformation
Multimodal Synergistic Collaboration
Benchmarking and Evaluation for Any-to-Any Multimodal Learning

Other topics include, but are not limited to:

Unified multimodal foundation and agentic models.
Representation learning for embodied and interactive systems.
Integration of underexplored modalities and cognitive perspectives on multimodal perception and reasoning.

Invited Speakers

Tentative Schedule

Location: June 4th, 2026, Room 502. All posters will be displayed in Hall A.

Time	Schedule	Speaker
7:55 - 8:00	Introduction and Opening Remarks	-
8:00 - 8:30	Keynote Talk 1	Zhedong Zheng (University of Macau)
Title: When AI Thinks Like Humans: Cognitive Biases and Uncertainty Awareness Abstract: In this talk, we explore how large multimodal models exhibit human-like cognitive biases, from the Stroop effect and other-race recognition gaps to "split-brain" functional asymmetries, revealing that AI not only learns patterns but also inherits psychological blind spots. We then introduce an uncertainty-aware framework ("Know We Don't Know") that leverages semantic entropy and multi-modal perturbations to detect and mitigate hallucinations, paving the way for safer, more reliable AI systems.
8:30 - 9:00	Keynote Talk 2	Saining Xie (NYU)
Title: Representation Space is the New Generative Interface Abstract: In this talk, I'll introduce a line of work including DiT, REPA, RAE, ScaleRAE, and the latest RAEv2, and discuss how representation spaces are becoming the new interface for generative models. Working in representation space leads to better data efficiency, stronger performance, and opens up new possibilities for bringing different modalities together under a unified framework. I'll share the key ideas behind these works and why representation learning is becoming increasingly central to the future of generative AI.
9:00 - 9:30	Keynote Talk 3	Georgia Gkioxari (Caltech)
Title: Beyond Image and Language: Building 3D Perception Systems Abstract: Vision-language models have transformed visual recognition, but building perception systems that understand and manipulate the 3D world requires moving beyond images, text, and object categories. In this talk, I will discuss recent work on 3D perception systems for reconstruction and editing, focusing on SAM 3D and Steer3D. These systems highlight both the promise and the challenges of learning from 3D data: unlike images and text, 3D data is difficult to obtain at scale, expensive to curate, and often limited in diversity. This motivates the need for efficient training frameworks that can make the most of available data while overcoming its limitations. I will discuss lessons on how to better model and sharpen complex 3D data distributions, and how to leverage pre-trained representations to improve generalization. Together, these works point toward a broader goal: building 3D perception systems that can reconstruct, edit, and reason about the 3D world.
9:30 - 10:00	Keynote Talk 4	Mohit Bansal (UNC Chapel Hill)
Title: Multimodal Unification, Communication, and Composable Generalization
10:00 - 10:30	Poster Session (Interactive) + Virtual Gallery + Coffee
10:30 - 11:00	Keynote Talk 5	Yossi Gandelsman (TTIC)
Title: A neuro-analysis of vision and language models
11:00 - 11:30	Keynote Talk 6	Manling Li (Northwestern)
Title: Any-View to Any-View: Learning Spatial Intelligence in Multimodal Models Abstract: A multimodal model sees one view at a time, yet there is only one consistent world behind those views. Spatial intelligence, then, is not handling any single image well, but holding one space steady as the viewpoint changes. This talk frames spatial representation around multiple views, the ability to travel from any view to any view, and considers what it takes to learn. A model can be asked to explore a scene and predict unseen viewpoints from seen ones, to actively construct a coherent spatial belief rather than memorize individual frames, and to recover a complete mental map from only a handful of observations. Together, we argue that a good spatial representation is one not bound to the view it was built from, and that this consistency is a foundation for spatial intelligence in multimodal models.
11:30 - 12:00	Keynote Talk 7	Paul Liang (MIT)
Title: Expanding AI's Senses: Touch, Smell, and Beyond Abstract: While recent AI advances have been driven by vision and language, humans experience the world through many more senses. In this talk, I will present our recent efforts to bring the senses of touch and smell into AI systems. I will introduce OpenTouch, an egocentric vision-tactile dataset that captures how humans physically interact with the world through touch and SmellNet, a large-scale dataset for machine olfaction that enables models to recognize and reason about odors. Extending smell perception to generation, we developed AromaGen, an AI-powered wearable system that transmits aromas given photos and descriptions of common foods, demonstrating a new medium for human interaction beyond the digital screen. Together, these projects illustrate emerging opportunities for multisensory AI that can perceive, reason, and generate rich sensory experiences, paving the way toward more embodied and human-centered intelligence.
12:00 - 12:20	Oral Presentations	-
12:00 - 12:10	SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models	Sofian Chaybouti
12:10 - 12:20	UniCanvas: A Diffusion-based Unified Model with Text-in-Image Joint Generation	Zeyuan Yang
12:20 - 12:35	Keynote Talk 8	Sponsor (Philo Labs)
Talk: Agent and World, in One Model Abstract: Most of the field treats agent and world model research as two programs. We don't. A world model is an agent whose policy you called dynamics. An agent is a world model you queried for actions. The cut between them is something you draw. We'll cover three things: 1) How we're improving VLMs' agentic capabilities in video AI. 2) How we're improving video gen models as environments. 3) What happens at the seam, and why we think this is also the right framing for safety.
12:35 - 12:45	Closing Remarks + Best Paper Award	TBD

Call for Papers

Important Dates

Important Dates for Review Process are as follows

~~Mar 07, 2026 AOE~~

Archival Paper Submission Deadline

~~Apr 03, 2026 AOE~~

Non-archival Paper Submission Deadline

~~Apr 18, 2026~~

All Workshop Paper Notification Date

~~Apr 24, 2026~~

Archival/Non-archival Paper, Camera-ready

Paper Submission and Acceptance

We welcome technical, position, or perspective papers related to the topics outlined below. All submissions must be written in English, follow the official CVPR proceedings format, and adhere to the double-blind review policy.

Submit Paper

Archival/Proceedings Track: Submitted papers must be 8 pages, formatted using the CVPR 2026 author guidelines, and will appear in the official CVPR workshop proceedings. If an 8-page submission is not accepted, authors may opt in to be considered for the non-archival track instead; to do so, submit only to the Archival track. Deadline for these submissions is March 07 11:59 PM, 2026 AOE.
Non-archival Track: Submitted papers under this track can be either short (2–4 pages) or regular (up to 8 pages), including figures and tables. These will not be published with the main conference proceedings. Deadline for these submissions is April 03 11:59 PM, 2026 AOE.

Best Paper Awards will be presented based on reviewer scores and the workshop committee’s evaluation.

All accepted papers will be presented as posters during the workshop. Poster sessions will be conducted onsite with dedicated time for interactive discussions.

Workshop On Any-to-Any Multimodal Learning

About the Workshop

Topics and Themes

Invited Speakers

Tentative Schedule

Call for Papers

Important Dates

Paper Submission and Acceptance

Organization

Steering Committee

Executive Commitee

About the Workshop

Topics and Themes

Invited Speakers

Tentative Schedule

Call for Papers

Important Dates

Paper Submission and Acceptance

Organization

Steering Committee

Executive Commitee

Sponsors