A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

1Department of Electronic and Telecommunication Engineering, Univerity of Moratuwa, Sri Lanka
2 Mohamed bin Zayed Univerity of Artificial Intelligence, Abu Dhabi, UAE
The Fourteenth International Conference on Learning Representations, ICLR 2026

Abstract

Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation.

🔥 Contributions

  1. We introduce a numerical optimization method, called A-TPT, for better calibration of test-time prompt tuning for VLMs. This resolves the suboptimal performance of existing leading calibration techniques for test-time prompt tuning.

  2. We introduce novel angular diversity that effectively promotes the diversity among textual features, thereby improving the calibration capabilities of VLMs when N > |D| and N < |D|. This is accomplished by maximizing the minimum pairwise angular distance between normalized textual features.

  3. We conduct extensive experiments to validate the generalizability of our approach on different datasets, including medical datasets, across various baselines. The results show that A-TPT surpasses state-of-the-art methods in calibration performance. We also provide thorough analyses, including theoretical aspects. Moreover, our approach provides superior calibration compared to the zero-shot CLIP model, which reveals improved calibration.

Angular Diversity (AD) vs. Expected Calibration Error (ECE)

Hard Prompts

Hard prompts

Tuned Prompts

Tuned prompts

Comparison with C-TPT and O-TPT

Radar Plot
Hypersphere Optimization Sketch

Results

Comparison on Fine-grained Datasets

Method Metric Datasets Average
ImageNet DTD Flowers102 Food101 SUN397 Aircrafts OxfordPets Caltech101 UCF101 EuroSAT Stanford Cars
CLIP ViT-B/16 (512-d)
Baseline Acc. 66.7044.3067.3083.6062.5023.9088.0092.9065.0041.3065.3063.70
ECE 2.128.503.002.392.535.114.375.503.5913.894.254.43
TPT Acc. 69.0046.7069.0084.7064.5023.4087.1093.8067.3042.4066.3065.00
ECE 10.6021.2013.503.9811.3016.805.774.512.5413.205.1611.60
C-TPT Acc. 68.5046.0069.8083.7064.8024.8588.2093.6365.7043.2065.8064.57
ECE 3.1511.905.043.435.044.361.904.242.5413.201.595.13
O-TPT Acc. 67.3345.6870.0784.1364.2323.6487.9593.9564.1642.8464.5364.41
ECE 1.967.883.871.464.933.681.903.802.3412.981.784.23
A-TPT (Ours) Acc. 67.7045.5169.2283.6466.0423.7688.3393.8766.1644.0665.7864.92
ECE 1.454.763.611.373.283.141.172.762.123.921.092.61
CLIP RN50 (1024-d)
Baseline Acc. 58.1040.0061.0074.0058.6015.6083.8085.8058.4023.7055.7055.90
ECE 2.099.913.193.113.546.455.914.333.0515.404.705.61
TPT Acc. 60.7041.5062.5074.9061.1017.0084.5087.0059.5028.3058.0057.70
ECE 11.4025.7013.405.259.2416.103.655.0412.4022.503.7611.70
C-TPT Acc. 60.2042.2065.2074.7061.0017.0084.1086.9059.7027.8056.5057.75
ECE 3.0119.804.141.862.9310.702.772.073.8315.101.946.19
O-TPT Acc. 58.9741.9065.6174.2260.8516.7783.4086.8658.8428.3556.4457.47
ECE 3.1016.532.501.203.208.183.502.752.6014.711.695.45
A-TPT (Ours) Acc. 58.4440.9064.8974.1060.4614.5883.4886.5760.2432.1457.0857.53
ECE 2.496.412.391.112.906.142.471.982.342.511.382.92

Comparison on Natural Distribution Shift Datasets

Method Metric Datasets Average
ImageNet-A ImageNet-V2 ImageNet-R ImageNet-S
CLIP ViT-B/16 (512-d)
Baseline Acc. 47.8060.8074.0046.1057.20
ECE 8.613.013.584.955.04
TPT Acc. 52.6063.0076.7047.5059.90
ECE 16.4011.104.3616.1012.00
C-TPT Acc. 51.6062.7076.0047.9059.60
ECE 8.166.231.547.355.82
O-TPT Acc. 49.8761.6572.5547.1257.80
ECE 7.223.971.466.874.88
A-TPT (Ours) Acc. 50.3960.9074.8746.0958.06
ECE 6.452.961.394.873.92
CLIP RN50 (1024-d)
Baseline Acc. 21.7051.4056.0033.3040.60
ECE 21.303.332.073.157.46
TPT Acc. 25.2054.6058.9035.1043.50
ECE 31.0013.109.1813.7016.70
C-TPT Acc. 23.4054.7058.0035.1042.80
ECE 25.408.584.579.7012.10
O-TPT Acc. 23.0753.1154.4733.9841.16
ECE 24.563.874.475.859.69
A-TPT (Ours) Acc. 21.6651.4855.7833.3740.57
ECE 21.143.103.963.097.82

Qualitative Results - Reliability Diagrams

CLIP ViT-B/16

C-TPT: Food101

C-TPT: Food101

C-TPT: DTD

C-TPT: DTD

C-TPT: Flower102

C-TPT: Flower102

C-TPT: Aircraft

C-TPT: Aircraft

C-TPT: UCF101

C-TPT: UCF101

C-TPT: Stanford Cars

C-TPT: Stanford Cars

C-TPT: SUN397

C-TPT: SUN397

O-TPT: Food101

O-TPT: Food101

O-TPT: DTD

O-TPT: DTD

O-TPT: Flower102

O-TPT: Flower102

O-TPT: Aircraft

O-TPT: Aircraft

O-TPT: UCF101

O-TPT: UCF101

O-TPT: Stanford Cars

O-TPT: Stanford Cars

O-TPT: SUN397

O-TPT: SUN397

A-TPT: Food101

A-TPT: Food101

A-TPT: DTD

A-TPT: DTD

A-TPT: Flower102

A-TPT: Flower102

A-TPT: Aircraft

A-TPT: Aircraft

A-TPT: UCF101

A-TPT: UCF101

A-TPT: Stanford Cars

A-TPT: Stanford Cars

A-TPT: SUN397

A-TPT: SUN397


CLIP RN50

C-TPT: Aircraft

C-TPT: Aircraft

C-TPT: UCF101

C-TPT: UCF101

C-TPT: Stanford Cars

C-TPT: Stanford Cars

C-TPT: SUN397

C-TPT: SUN397

O-TPT: Aircraft

O-TPT: Aircraft

O-TPT: UCF101

O-TPT: UCF101

O-TPT: Stanford Cars

O-TPT: Stanford Cars

O-TPT: SUN397

O-TPT: SUN397

A-TPT: Aircraft

A-TPT: Aircraft

A-TPT: UCF101

A-TPT: UCF101

A-TPT: Stanford Cars

A-TPT: Stanford Cars

A-TPT: SUN397

A-TPT: SUN397

Acknowledgements

The computational resources for this research were supported by the Accelerating Higher Education Expansion and Development (AHEAD) Operation Grant No. 6026-LK/8743-LK from the Ministry of Higher Education, Sri Lanka, funded by the World Bank and the National Research Council of Sri Lanka Grant No. 19-080.

BibTeX

If you find our work useful in your research, please consider citing us.

@inproceedings{ahamed2026atpt,
    title     = {A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models},
    author    = {Ahamed, Shihab Aaqil and Thanthrige, Udaya SKP and Rodrigo, Ranga and Khan, Muhammad Haris},
    booktitle = {The Fourteenth International Conference on Learning Representations},
    year      = {2026},
    url       = {https://openreview.net/forum?id=VhlSBZebEw}
}
    
-->

Contact

If you have any questions, please feel free to reach out at shihabaaqilahamed@gmail.com