Survey 1
Any-to-Any Multimodal Learning aims to build a unified intelligent system capable of understanding, reasoning, and generating across arbitrary combinations of modalities. Unlike traditional multimodal models, which are typically limited to fixed input-output modes (such as text-to-image, image-to-text, etc.), the Any-to-Any framework emphasizes flexible modeling capabilities in scenarios involving multiple inputs, multiple outputs, and interleaved modalities. The core challenge of Any-to-Any Multimodal Learning lies in how to design a unified architecture that supports multimodal collaborative modeling, how to align semantics and control information flow in complex interleaved structures, and how to evaluate the model's generalization ability and scalability in real-world open environments. Therefore, this direction not only involves innovations in model architecture but also encompasses the systematic construction of data organization methods, task design paradigms, and evaluation frameworks. This page is a central hub to organize our representative models, benchmarks, workshops, and surveys.