πŸ”¬

PFM-DenseBench

To What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?

A Large-Scale Benchmark Evaluating 17 PFMs across 18 Datasets with 5 Adaptation Strategies

17
Foundation Models
18
Segmentation Datasets
5
Adaptation Strategies

Weiming Chen*1, Xitong Ling*1, Xidong Wang2, Zhenyang Cai2, Yijia Guo3, Mingxi Fu1, Ziyi Zeng2, Minxi Ouyang1, Jiawen Li1, Yizhi Wang1, Tian Guan1, Benyou Wang#2, Yonghong He#1

* Equal contribution    # Corresponding authors

1Tsinghua University, Shenzhen β€’ 2CUHK, Shenzhen β€’ 3Peking University, Beijing

Why PFM-DenseBench?

Bridging the gap between foundation model evaluation and clinical dense prediction needs

🎯

Dense Prediction Focus

While most PFM benchmarks focus on image-level classification, clinical diagnosis relies on precise pixel-level segmentation. We evaluate PFMs where it matters most.

πŸ“Š

Systematic Evaluation

Unified protocols across 18 datasets enable fair comparisons. No more scattered benchmarks with incompatible setups and metrics.

βš™οΈ

Adaptation Insights

Discover which fine-tuning strategy works best for your task. From frozen encoders to CNN adapters, we reveal what actually drives performance.

πŸ”¬

Multi-Scale Coverage

From nuclei to glands to tissue regionsβ€”our benchmark spans the full spectrum of biological scales relevant to computational pathology.

πŸ’‘

Scaling Law Analysis

Does bigger always mean better? Our findings challenge conventional wisdom and reveal the true drivers of dense prediction performance.

πŸ”„

Reproducible Science

All code, configs, and containers are publicly available. Bootstrap confidence intervals ensure statistically rigorous comparisons.

Benchmark Overview

Comprehensive evaluation framework covering diverse datasets, models, and adaptation strategies

PFM-DenseBench Framework Overview

Figure 1. Overview of PFM-DenseBench: A unified benchmark for evaluating Pathology Foundation Models on dense prediction. The framework comprises dataset curation, model and strategy evaluation, and benchmark validation.

πŸ“ 18 Datasets

Nuclear Segmentation

CoNIC2022, PanNuke, CPM15, CPM17, CoNSeP, Kumar, NuCLS, Lizard, TNBC

Gland Segmentation

GlaS, CRAG, RINGS

Tissue Segmentation

BCSS, CoCaHis, COSAS24, EBHI, WSSS4LUAD, Janowczyk

πŸ€– 17+ PFMs

Vision-Only Models

UNI, UNI2-h, Virchow, Virchow2, Phikon, Phikon-v2, H-Optimus-0/1, Prov-GigaPath, Hibou-L, Kaiko-L, Lunit, PathOrchestra, Midnight-12k

Vision-Language Models

CONCH, CONCHv1.5, MUSK

βš™οΈ 5 Strategies

Frozen: Decoder only trained
LoRA: Low-rank adaptation
DoRA: Weight-decomposed adaptation
CNN Adapter: Convolutional branch
Trans. Adapter: Transformer blocks
Adaptation Strategies Architecture

Figure 2. Architecture of adaptation strategies. (A) Low-Rank Adaptation (LoRA/DoRA). (B) CNN Adapter with multi-scale convolutional branches. (C) Transformer Adapter with additional transformer blocks.

Model Performance

Comprehensive results across all datasets and models (mDice metric)

πŸ“ˆ Dataset SOTA Results

Best mDice score achieved on each dataset with the corresponding model and adaptation method

Dataset mDice (SOTA) Best Model Method
Loading data...

πŸ† Model Rankings

Average rank across all 18 datasets and 5 adaptation methods (lower is better)

Rank Model Avg. Rank (lower is better)
Loading data...

Citation

If you find PFM-DenseBench useful in your research, please cite our paper

@misc{chen2026extenttokenlevelrepresentationspathology,
  title={To What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?},
  author={Weiming Chen and Xitong Ling and Xidong Wang and Zhenyang Cai and Yijia Guo and Mingxi Fu and Ziyi Zeng and Minxi Ouyang and Jiawen Li and Yizhi Wang and Tian Guan and Benyou Wang and Yonghong He},
  year={2026},
  eprint={2602.03887},
  archivePrefix={arXiv},
  primaryClass={eess.IV},
  url={https://arxiv.org/abs/2602.03887},
}