In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence.
The proposed tasks require models to integrate information from both an image I and a text passage T to perform reasoning, ensuring that neither modality alone is sufficient for correct inference. The tasks explicitly emphasize multimodal reasoning, where the fusion of visual and textual context is essential for deriving accurate and consistent conclusions.
Task-I: Truth Evaluation (True/False/Unknown) Question. Given an image I, a text passage T , and an argument A, the model must determine the truth value of the argument based on the combined information from I and T . Specifically, the model outputs the truth value Truth(A) ∈ {True, False, Unknown} and generates a sequence of reasoning steps R = {R1, R2, . . . , Rn}, where each Ri represents an individual step that contributes to the final decision. Formally, the input is a triplet (I, T, A), and the output consists of Truth(A) and R.
Task-II: Multiple Choice Question. Given an image I, a text passage T , and candidate arguments {A1, A2, A3, A4}, the model must select the argument that best matches the image and text, denoted as BestArgument(I, T ) ∈ {A1, A2, A3, A4}. Additionally, the model must provide detailed reason- ing steps R = {R1, R2, . . . , Rn}, where each Ri details a step in the reasoning process. Formally, the input is a triplet (I, T, {A1, A2, A3, A4}), and the output consists of BestArgument(I, T ) and R.
We collect images from various sources such as COCO, Flickr30k, nocaps, MIMIC, RVL_CDIP, ScienceQA, and manually collected Traffic Reports. Visual details for each image are extracted using GPT-4o, ensuring diverse and fine-grained descriptions. We carefully select non-trivial logical inference rules, such as Modus Ponens and Hypothetical Syllogism, drawn from propositional logic (PL), first-order logic (FOL), and non-monotonic logic (NM). These rules then form meaningful but abstract reasoning chains through logical combinations. The abstract chains are grounded in real-world contexts by leveraging extracted visual features and relevant retrieved text from sources like healthcare, traffic reports, and Wikipedia. Questions and answers are then generated based on these instantiated reasoning chains, using rule-based substitution.
To ensure the quality and relevance of the dataset, both automatic and manual quality control procedures are employed. Automatic checks include assessing lexical similarity and commonsense plausibility, while human annotators verify the accuracy of visual details and the real-world relevance of the generated context. Instances that fail these checks are filtered out, ensuring a high-quality, logically sound, and contextually relevant dataset.
Figure: Data construction pipeline and quality control overview (placeholder).