🔬

PFM-DenseBench

To What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?

A Large-Scale Benchmark Evaluating 17 PFMs across 18 Datasets with 5 Adaptation Strategies

Foundation Models

Segmentation Datasets

Adaptation Strategies

Weiming Chen^*1, Xitong Ling^*1, Xidong Wang², Zhenyang Cai², Yijia Guo³, Mingxi Fu¹, Ziyi Zeng², Minxi Ouyang¹, Jiawen Li¹, Yizhi Wang¹, Tian Guan¹, Benyou Wang^#2, Yonghong He^#1

* Equal contribution # Corresponding authors

¹Tsinghua University, Shenzhen • ²CUHK, Shenzhen • ³Peking University, Beijing

Why PFM-DenseBench?

Bridging the gap between foundation model evaluation and clinical dense prediction needs

🎯

Dense Prediction Focus

While most PFM benchmarks focus on image-level classification, clinical diagnosis relies on precise pixel-level segmentation. We evaluate PFMs where it matters most.

📊

Systematic Evaluation

Unified protocols across 18 datasets enable fair comparisons. No more scattered benchmarks with incompatible setups and metrics.

⚙️

Adaptation Insights

Discover which fine-tuning strategy works best for your task. From frozen encoders to CNN adapters, we reveal what actually drives performance.

🔬

Multi-Scale Coverage

From nuclei to glands to tissue regions—our benchmark spans the full spectrum of biological scales relevant to computational pathology.

💡

Scaling Law Analysis

Does bigger always mean better? Our findings challenge conventional wisdom and reveal the true drivers of dense prediction performance.

🔄

Reproducible Science

All code, configs, and containers are publicly available. Bootstrap confidence intervals ensure statistically rigorous comparisons.

Benchmark Overview

Comprehensive evaluation framework covering diverse datasets, models, and adaptation strategies

Figure 1. Overview of PFM-DenseBench: A unified benchmark for evaluating Pathology Foundation Models on dense prediction. The framework comprises dataset curation, model and strategy evaluation, and benchmark validation.

📁 18 Datasets

Nuclear Segmentation

CoNIC2022, PanNuke, CPM15, CPM17, CoNSeP, Kumar, NuCLS, Lizard, TNBC

Gland Segmentation

GlaS, CRAG, RINGS

Tissue Segmentation

BCSS, CoCaHis, COSAS24, EBHI, WSSS4LUAD, Janowczyk

🤖 17+ PFMs

Vision-Only Models

UNI, UNI2-h, Virchow, Virchow2, Phikon, Phikon-v2, H-Optimus-0/1, Prov-GigaPath, Hibou-L, Kaiko-L, Lunit, PathOrchestra, Midnight-12k

Vision-Language Models

CONCH, CONCHv1.5, MUSK

⚙️ 5 Strategies

Frozen: Decoder only trained

LoRA: Low-rank adaptation

DoRA: Weight-decomposed adaptation

CNN Adapter: Convolutional branch

Trans. Adapter: Transformer blocks

Figure 2. Architecture of adaptation strategies. (A) Low-Rank Adaptation (LoRA/DoRA). (B) CNN Adapter with multi-scale convolutional branches. (C) Transformer Adapter with additional transformer blocks.

Model Performance

Comprehensive results across all datasets and models (mDice metric)

📈 Dataset SOTA Results

Best mDice score achieved on each dataset with the corresponding model and adaptation method

Dataset	mDice (SOTA)	Best Model	Method
Loading data...

🏆 Model Rankings

Average rank across all 18 datasets and 5 adaptation methods (lower is better)

Rank	Model	Avg. Rank (lower is better)
Loading data...

Citation

If you find PFM-DenseBench useful in your research, please cite our paper

@misc{chen2026extenttokenlevelrepresentationspathology,
  title={To What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?},
  author={Weiming Chen and Xitong Ling and Xidong Wang and Zhenyang Cai and Yijia Guo and Mingxi Fu and Ziyi Zeng and Minxi Ouyang and Jiawen Li and Yizhi Wang and Tian Guan and Benyou Wang and Yonghong He},
  year={2026},
  eprint={2602.03887},
  archivePrefix={arXiv},
  primaryClass={eess.IV},
  url={https://arxiv.org/abs/2602.03887},
}