CaRA

Abstract

Modern methods for fine-tuning Vision Transformers, such as Low-Rank Adaptation (LoRA) and its variants, demonstrate impressive performance. However, these methods ignore the high-dimensional nature of Multi-Head Attention (MHA) weight tensors. To address this limitation,we propose Canonical Rank Adaptation (CaRA). CaRA leverages tensor mathematics, first by tensorising the transformer into two different tensors: one for projection layers in MHA and the other for feed-forward layers. Second, the tensorised formulation is fine-tuned using the low-rank adaptation in the Canonical-Polyadic Decomposition (CPD) form. Employing CaRA efficiently minimises the number of trainable parameters. Experimentally, CaRA outperforms existing Parameter-Efficient Fine-Tuning (PEFT) methods in visual classification benchmarks such as the Visual Task Adaptation Benchmark (VTAB)-1k and the Fine-Grained Visual Categorization (FGVC) benchmark.

Method

While LoRA exhibit significant advantages compared to full fine-tuning, previous works demonstrated that tensor based fine-tuning is highly efficient. Additionally, given the high-dimensional nature of MHA, it is evident that utilising tensor representations, especially the CPD form, for low-rank updates provides a compact and expressive approach, all while offering a smaller parameter count. In this paper, we present a novel tensorisation of ViT, followed by CaRA representation of low-rank updates.

Figure 2: CaRA's tensorisation process.

Tensorisation:
Our tensorisation approach involves creating two tensors: one for MHA's projection layer and second for the feed-forward layers. This will explicitly allow us to represent the low-rank update for MHA's projection layers at higher dimensions. First we stack the query, key, and value projection matrices i.e., $W^Q \in \mathbb{R}^{d_{\text{model}}\times d_k}$, $W^K \in \mathbb{R}^{d_{\text{model}}\times d_k}$, and $W^V \in \mathbb{R}^{d_{\text{model}}\times d_v}$ for the individual head $i$ in a block resulting in $$E_i = [W_i^Q, W_i^K, W_i^V] \in \mathbb{R}^{3\times d_{\text{model}}\times d_h},$$ where $d_h = d_k = d_v$ and enclosing square brace represent stacking operation. Furthermore, we stack the resulting tensors $E_i$ across all $h$ heads for a specific transformer block $j$ $$L_j = [E_1, E_2, ...., E_h] \in \mathbb{R}^{3\times d_{\text{model}}\times h\times d_h}.$$ Finally, stack the tensors across all the $l$ blocks of transformer, resulting in $$W^{\text{mha}} = [L_1, L_2, ..., L_l] \in \mathbb{R}^{3\times l\times d_{\text{model}}\times h\times d_h}.$$ We observe that combined representation of $3$ and $l$ preforms better compared to above five-dimensional representation (see Section 5.2 in paper). Following this, we represent the $W^{\text{mha}}$ as four-dimensional tensor $\mathbb{R}^{3l\times d_{\text{model}}\times h\times d_h}$. Similarly, we stack the layers $W^O, W^{\text{up}},$ and $W^{\text{down}}$ across all the $l$ transformer blocks to result in a tensor $W^{\text{ffn}} \in \mathbb{R}^{9l\times d_{\text{model}}\times d_{\text{model}}}$.

Low-Rank Update:
With the tensorised network in place, we propose a novel low-rank update representing using CPD format for fine-tuning the ViT. Using CPD offers benefits in terms of parameter efficiency and stable performance. The low-rank update for $W^{\text{mha}}$ and $W^{\text{ffn}}$ is given as $$ \delta W^{\text{mha}} = \{ \lambda^A; A^{(1)}, A^{(2)}, A^{(3)}, A^{(4)}\} = \sum_{r=1}^R \lambda_r^A a_r^{(1)} \circ a_r^{(2)} \circ a_r^{(3)} \circ a_r^{(4)},$$ $$ \delta W^{\text{ffn}} = \{ \lambda^B; B^{(1)}, B^{(2)}, B^{(3)}\} = \sum_{r=1}^R \lambda_r^B b_r^{(1)} \circ b_r^{(2)} \circ b_r^{(3)}.$$ Figure 1 depicts our weight update for one of the query projections in a specific transformer block. Additionally, below table presents a comparative summary of various tensor-based low-rank updates.

Results

Visual Task Adaptation Benchmark

VTAB-1k evaluation results with ViT-B/16 backbone on wide range of 19 datasets. We average the number of parameters over gorup-wise mean values and indicate both group-wise mean and overall mean accuracy.

FGVC benchmark

Evaluation results of ImageNet-21k pretrained ViT-B/16 on FGVC benchmark.

Fine-Tuning ViT-L

Evaluation results of ImageNet-21k pretrained ViT-L on series of Image classification datasets.

Ablations

Rank Robustness

CaRA's robustness to varying rank in terms of number of trainable parameters (left) and accuracy (right) compared to other tensor-based fine-tuning methods. Number of parameters for Tucker (FacT-TK) and Tensor-Train (FacT-TT) representations grow faster than CaRA due to the presence of tensors in their low-rank update formulation. Whereas CaRA representation only contain matrices. The effect of exponential parameter growth negatively impacts performance of Tensor-Train and Tucker represents, CaRA's performance slightly increase with rank.

Interpretability

We attempt to understand CaRA's capability by studying saliency maps from Integrated gradients on FGVC dataset. We observe concentrated gradient over specific parts for rank 32, while rank 16 show more spread out gradient activations.

BibTeX

@inproceedings{
veeramacheneni2025canonical,
title={Canonical Rank Adaptation: An Efficient Fine-Tuning Strategy for Vision Transformers},
author={Lokesh Veeramacheneni and Moritz Wolter and Hilde Kuehne and Juergen Gall},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=vexHifrbJg}
}

Canonical Rank Adaptation (CaRA): An Efficient Fine-Tuning Strategy for Vision Transformers

Figure 1: Illustration of fine-tuning a single query projection layer in one transformer block using LoRA on the left and CaRA on the right. CaRA represents the low-rank update in CPD form.