Vision-Language Models Create
Cross-Modal Task Representations

UC Berkeley

ICML 2025

TLDR: We find conceptually equivalent inputs are mapped to a shared task representation, regardless of modality.
Previous Title: Task Vectors are Cross-Modal




Abstract

Autoregressive vision-language models (VLMs) can handle many tasks within a single model, yet the representations that enable this capability remain opaque. We find that VLMs align conceptually equivalent inputs into a shared task vector, which is invariant to modality (text, image) and format (examples, instruction), and may simplify VLM processing. We measure this alignment via cross-modal transfer--the ability of a task vector derived in one modality to trigger the correct generation in another--on a range of tasks and model architectures. Although the task vector is highly compressed, we find that this single vector outperforms prompting the model with the full task information, unique to this cross-modal case. Furthermore, we show that task vectors can be transferred from a base language model to its fine-tuned vision-language counterpart, and that they can be derived solely from instructions without the need for examples. Taken together, our findings shed light on how VLMs internally process task information, and how they map different modalities into common semantic representations.


Cross-Modal Patching



Depending on the input instruction or in-context examples, VLMs can dynamically adjust the task executed on the image. This flexibility poses a challenge: each task can be defined in a different way, and memorizing every possible variation is impractical. There needs to exist some form of compression, or representation sharing, to manage this complexity.

We provide evidence of such a shared task representation in VLMs. We study a special token position at the end of the sequence, also known as the task vector (Hendel et al., 2023; Todd et al., 2024). We measure representational alignment via cross-modal patching, where we extract the task vector in one modality (e.g., text examples) then inject it onto another modality (e.g., image queries). Patching induces the model to generate a different, task-specific output, which we find is effective even across modalities.


Cross-Modal Tasks

Task:
Click the underlined text for dropdown menu

The capital city of the country: Greece : Athens : Athens
The last word of the official currency of the country: Italy : Euro : Euro
The scientific name of the animal’s species in latin: Gray Wolf : Canis lupus : Canis lupus
The term for the baby of the animal: Common Dolphin : calf : calf
The color of the food: Persimmon : orange : orange
The flavor descriptor of the food: Strawberry : sweet : sweet

We design six tasks inspired by the text ICL examples proposed in prior work, where we add alternative specifications such as instructions and image examples.


Transferring from Text Examples to Image Queries

Model: Task:
Click the underlined text for dropdown menu

We compare the task accuracy of few-shot prompting (Prompt) and patching (Patch). Specifically we take image queries, then evaluate the rate at which each method triggers the model to produce the correct task-specific output. In the cross-modal setting, when the task is specified with text examples, prompting struggles whereas patching is highly effective (Text Examples Prompt vs. Patch). In fact, text examples can be just as useful as image examples for inducing the task (Text vs. Image Examples).

Finding 1. VLMs struggle with cross-modal few-shot prompting, which is fixed by cross-modal patching.

Transferring from LLMs to VLMs

Model: Task:
Click the underlined text for dropdown menu

Since many VLMs are initialized from a pre-trained LLM, we explore the extent to which the task representations are preserved after fine-tuning. We find that, when given the same text examples, the VLM and its corresponding LLM produce task vectors with high cosine similarity. This motivates us to perform inter-model patching, where we apply task vectors derived from text examples in the LLM to image queries in the VLM.

Finding 2. Task vectors can be patched from the base LLM to its corresponding VLM.

Deriving Task Vectors From Instructions



We also demonstrate that task vectors can not only be defined with examples but also instructions. To achieve this, we feed the instruction to the model and apply the same cross-modal patching procedure. These instruction-based task vectors can also be ensembled with example-based ones to improve sample efficiency.

Finding 3. Beyond examples, task vectors can also be defined with instructions, which are more concise.

Representation Clustering Across Modalities

Finally, we visualize the representations in 2D using t-SNE (van der Maaten et. al., 2008), at the same middle layer. We separate the visualization by token position; specifically, those derived from the image and text embeddings in the preceding context (Context Embeddings) vs. the final token (Task Vector). Evidently, the task vector summarizes the noisy context such that it clusters by task (color) and not modality (shape).

Finding 4. VLMs map text and image examples, as well as instructions, to similar task vectors, despite differences between the image or text embeddings in the context.

Relevant Readings

If you like this work, these other projects might also interest you.

Acknowledgements

We would like to thank Jiahai Feng, Stephanie Fu, Alexander Pan, Alberto Hojel, Lisa Dunlap, Chung Min Kim, Brent Yi, Candace Ross, and Koustuv Sinha for helpful discussions and feedback on the paper.

BibTeX


    @inproceedings{luo2025vlm,
      title={Vision-Language Models Create Cross-Modal Task Representations}, 
      author={Grace Luo and Trevor Darrell and Amir Bar},
      booktitle={ICML},
      year={2025}
    }