Authors: Kathryn Wantlin, Chenwei Wu, Shih-Cheng Huang, Oishi Banerjee, Farah Dadabhoy, Veeral Vipin Mehta, Ryan Wonhee Han, Fang Cao, Raja R. Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H. Tison, Alex Tamkin, Pranav Rajpurkar
Page 1:
BenchMD: A Benchmark for Unified Learning on
Medical Images and Sensors
Kathryn Wantlin1, 2, Chenwei Wu1, Shih-Cheng Huang3, Oishi Banerjee1, Farah Dadabhoy4,
Veeral Vipin Mehta5, Ryan Wonhee Han3, Fang Cao3, Raja R. Narayan4, Errol Colak6,
Adewole Adamson7, Laura Heacock8, Geoffrey H. Tison9, Alex Tamkin3,*, Pranav Rajpurkar1,*
Abstract
Medical data poses a daunting challenge for AI algorithms: it exists in many
different modalities, experiences frequent distribution shifts, and suffers from a
scarcity of examples and labels. Recent advances, including transformers and self-
supervised learning, promise a more universal approach that can be applied flexibly
across these diverse conditions. To measure and drive progress in this direction,
we present BenchMD: a benchmark that tests how well unified, modality-agnostic
methods, including architectures and training techniques (e.g. self-supervised learn-
ing, ImageNet pretraining),perform on a diverse array of clinically-relevant medical
tasks. BenchMD combines 19 publicly available datasets for 7 medical modalities,
including 1D sensor data, 2D images, and 3D volumetric scans. Our benchmark
reflects real-world data constraints by evaluating methods across a range of dataset
sizes, including challenging few-shot settings that incentivize the use of pretraining.
Finally, we evaluate performance on out-of-distribution data collected at different
hospitals than the training data, representing naturally-occurring distribution shifts
that frequently degrade the performance of medical AI models. Our baseline results
demonstrate that no unified learning technique achieves strong performance across
all modalities, leaving ample room for improvement on the benchmark. Code is
released at: https://github.com/rajpurkarlab/BenchMD .
1 Introduction
Recent advances in transformers and self-supervised learning (SSL) have enabled state-of-the-art
performance across many modalities, including text, images and videos [ 18,41]. A core feature
of these methods is their remarkable versatility: they lessen the need for labeled data, and can be
applied flexibly across modalities, reducing the need to develop custom methods for each application
area [ 55,56]. Measuring this progress requires benchmarks with breadth , to capture the diversity
of applications and modalities, as well as depth , to ensure external validity by involving experts in
the benchmark-formulation process [ 47,10,42]. These requirements are especially salient in the
medical domain, where existing benchmarks have been criticized for addressing synthetic tasks with
low clinical relevance [8].
1Harvard University
2Princeton University
3Stanford University
4Massachusetts General Hospital
5Stony Brook University Hospital6University of Toronto
7University of Texas at Austin
8NYU Langone Health
9University of California, San Francisco
*Equal senior authorship.
Preprint. Under review.arXiv:2304.08486v2 [cs.CV] 26 Jun 2023
Page 2:
Figure 1: The BenchMD benchmark consists of 19 real-world medical datasets across 7 medical
modalities. Successful methods will achieve high performance when evaluated on out-of-distribution
data.
To address this gap, we propose BenchMD, a new benchmark for unified learning across modalities
that is grounded in real-world medical interpretation tasks and distribution shifts. BenchMD evaluates
unified architectures and training techniques (e.g. SSL, ImageNet pretraining), on 19 datasets
for 7 medical modalities. The wide variety of modalities reflects the heterogeneity of medical image
and sensor data, which can be produced by dozens of different technologies [ 62]. Specifically, we
evaluate methods using 1D data from electrocardiogram (ECG) and electroencephalogram (EEG)
sensors, 2D image data from chest X-rays (CXR), mammograms, dermoscopic images, and fundus
images, and 3D volumetric data from low-dose computed tomography (LDCT) scans (see Figure
1). Current methods are often specialized for these different modalities; for example, contrastive
learning techniques typically require modality-specific data augmentations [ 33]. Researchers for
each medical application are similarly focused on designing domain-specific methods, using trial and
error to establish which components (different architectures, self-supervision algorithms, etc.) will be
helpful for their problem. In contrast, we encourage the development of flexible, unified methods
that can be applied out-of-the-box without customization. Succeeding on arbitrary data requires
unified architectures, and there is emerging evidence of promising candidates here [ 55,56,32]. Our
benchmark serves to accelerate progress in this area by enabling researchers to build domain-specific
models from stronger baseline components and encouraging valuable collaboration on methods
that are broadly useful across medical domains and modalities [ 44]. Key to our unified approach,
BenchMD constructs standardized, clinically-impactful tasks for evaluation in each modality, each
validated by experts to have high medical relevance given available label information.
We construct our benchmark to enable advances on two additional fronts. First, label shortages have
historically posed a serious obstacle to model development, so our benchmark tests performance
under severe data scarcity , incentivizing the use of SSL techniques that exploit unlabeled data. In
order to explore the label efficiency of different methods, we assess performance across multiple
settings with different amounts of labeled data. Second, we investigate how models perform under
naturally-occurring distribution shifts , such as when they are trained on data from one hospital and
deployed in another. To this end, we train models on one in-distribution (ID) source dataset and then
test zero-shot transfer performance on out-of-distribution (OOD) data from unseen target datasets
collected at different hospitals. Thus, our definition of a successful “unified” method includes a
universally superior training objective and architecture that is applicable across a range of different
modalities and, in addition, generalizes well across distribution shifts within modalities.
We present BenchMD as an easy-to-use benchmark for assessing performance widely across medical
modalities and distributions. To make using BenchMD simple , we standardize preprocessing steps
and validation metrics (details in the Appendix), so users simply need to plug in new architectures and
training tasks. Additionally, we use only publicly available datasets , allowing users to easily access
BenchMD and replicate results. Using our benchmark, we provide initial baselines that demonstrate
significant variations in performance, with no technique achieving strong results across all modalities
and ImageNet-pretrained baselines exceeding the performance of existing domain-agnostic SSL
methods in some modalities. These results motivate the necessity of further research in universal,
2
Page 3:
Figure 2: Models for each modality are first trained on a source dataset, using unified methods across
modalities. They are then evaluated on out-of-distribution data from one or more target datasets.
generalizable methods for medical AI, and we discuss possible directions for future work which are
all easily facilitated by the publicly available code and data constructed for BenchMD. We expect our
work will accelerate the development of versatile methods for medicine and provide a valuable tool
for measuring advances in universal methods.
2 Related Work
Unified Techniques Across Modalities: Recent advances in deep learning have produced methods
that enable high performance and can be flexibly applied across modalities. Self-supervised learning
(SSL) techniques such as masked data modeling [ 20] and contrastive learning [ 34,54,55] can be
used to learn from unlabeled datasets across many modalities. Architectures are also increasingly
able to take in different modalities as input, producing models that can interchangeably process 2D
images and 3D videos [ 27,36,3]. Benchmarks like ours can rigorously evaluate these new methods,
assessing their performance on real-world tasks across multiple modalities.
Unified Medical AI: There have been limited efforts to unify architectures and training techniques
across medical image modalities in particular [ 67,64,19]. For example, Zhou et al. recently found
that self-supervised MAE pretraining on medical images offered better performance than ImageNet
pretraining when interpreting chest X-rays, MRI scans, and CT scans [ 66]. Similarly, Azizi et al.
found that a training procedure combining supervised learning on natural images with SSL pretraining
on medical images offered high performance on 6 2D medical image modalities [ 6]. BenchMD tracks
progress in this area and includes an unprecedented range of medical modalities, addressing a range
of 1D, 2D, and 3D medical images and sensors.
Existing Benchmarks for Multiple Modalities: Our work extends the line of thinking in the DABS
benchmarks, which evaluate the performance of modality-agnostic SSL techniques across modalities
ranging from text and genomics to X-rays and wearable sensor data [ 55,56]. We also take inspiration
from the WILDS benchmarks, which evaluate performance on out-of-distribution data within several
modalities [ 32,49,63]. BenchMD combines the strengths of both approaches, creating a benchmark
that is rooted in real-world modalities with direct clinical applicability: evaluating whether techniques
generalize well across modalities as well as to new distributions within each modality. Furthermore,
while some of DABS’s training datasets are unlabeled, we provide labeled source data to facilitate
comparison against non-SSL techniques such as supervised learning. Unlike both DABS and WILDS,
BenchMD tests model performance across settings with few-shot learning, exploring how label
availability affects performance. Our work also differs from both DABS and WILDS because we
focus on real-world medical tasks and cover a broader range of medical modalities, including 3D
volumetric scans.
3
Page 4:
3 Modalities and Datasets
We have curated a list of high-impact modalities and selected source and target datasets for evaluating
out-of-distribution (OOD) performance. Each modality we present in this benchmark is used to test
for prevalent diseases and significantly contributes to clinician workloads in current practice (see
[68,12,40,24,26,37,2,50]). For each modality, we select a highly-cited, large dataset as the source
dataset, whose training and validation splits we use for pretraining and in-distribution evaluation,
respectively. We also choose labeled target datasets, whose validation sets we use to test performance
on OOD data. The datasets, in addition to all being well-known and publicly accessible, were selected
to cover a range of important distribution shifts for each modality, including varying demographics,
collection technology, and annotation details. Per modality, we specify a task that is both clinically
relevant and unified across source and target datasets. (See the supplementary appendix.)
3.1 Electrocardiograms
12-lead electrocardiogram (ECG) measures the three-dimensional electrical activity of the heart
over time using electrodes placed on the skin. Classifying cardiovascular abnormalities from ECGs
is challenging because there are 12 1D channels, each corresponding to a different spatial axis,
and because diagnosis requires distinguishing irregular cardiovascular signals from noisy data. We
perform a single-label, 7-class classification task, with inputs consisting of 5 second recordings of
12-channel ECG signals with a sampling rate of 500Hz. Our set of 7 labels is unified across datasets
and derived at the discretion of medical experts: Normal, Conduction Disturbance, Hypertrophy,
Myocardial Infarction, Ischemic ST-T Changes, Atrial fibrillation/atrial flutter, and Other. We consider
four publicly available 12-lead ECG-waveform datasets: PTB-XL (source) [ 60,61,21], Chapman-
Shaoxing (target)[ 65], Georgia 12-Lead ECG Challenge (target)[ 22], and China Physiological Signal
Challenge (CPSC, target)[ 11]. We see the following distribution shifts: Demographics: PTB-
XL’s data was collected between 1989 and 1996, Chapman-Shaoxing’s in 2020, CPSC’s in 2018,
and Georgia’s in 2020. The Chapman-Shaoxing and CPSC datasets’ patients are based in China,
the Georgia datasets’ in the southeastern United States, and the PTB-XL datasets’ in Germany.
Collection Technology: The PTB-XL dataset used devices provided by Schiller AG, while the
Chapman-Shaoxing dataset was collected using devices from Zhejiang Cachet Jetboom Medical
Devices. Annotation Details: Although we group abnormalities into 7 categories that are consistent
across datasets, different datasets provide varying levels of additional granularity in their labels, with
different label distributions across datasets.
3.2 Electroencephalograms
Electroencephalograms (EEG) measure multi-channel 1D signals of electrical activity in the brain and
are used to diagnose sleep and seizure disorders[ 7]. Noise and intraclass variability in sampling rates,
signal quality, the number of leads used, and the length of the captured rhythm makes distinguishing
sleep stages difficult in EEGs. We perform a single-label sleep stage classification task on 2 traditional
central derivations channels (C3 and C4), 125 Hz, 30 second EEG signal inputs. We use the American
Academy of Sleep Medicine’s standard 5 labels: Wake, Rapid Eye Movement, Non-REM Stage 1,
Non- REM stage 2, and Non-REM stage 3[ 17]. We consider two publicly available datasets: the
Sleep Heart Health Study (SHHS) dataset (source) [ 46] and the ISRUC-Sleep dataset (target) [ 31].
We see the following distribution shifts: Demographics: The SHHS dataset was collected between
1995-1998, while the ISRUC dataset was collected between 2009–2013. The SHHS dataset includes
5,804 adults aged 40 and older, while the ISRUC dataset was collected from subjects whose ages
range from 20 years old to 85 years old, with an average age of 51. Moreover, the ISRUC dataset
was collected from a hospital in Coimbra, Portugal, while SHHS was collected by the National Heart
Lung & Blood Institute in the US. Collection Technology: The SHHS dataset was collected at a
sampling rate of 125 Hz while ISRUC was collected at 150 Hz. SHHS was also collected from
studies conducted in patient homes, while ISRUC was collected in a hospital setting.
3.3 Chest X-Rays
Chest X-rays are 2D grayscale projection radiographs of a patient’s heart, lungs, blood vessels,
airways, chest bones, and spine and are crucial for the diagnosis of cardiovascular diseases such
as atelectasis and edema. Chest X-ray classification is uniquely challenging compared to natural
4
Page 5:
image classification since radiographs are grayscale and always have similar frontal or lateral spatial
structures, with relevant abnormalities only occurring in a small region of the image[ 30]. We perform
a single-label classification task on 2D grayscale chest x-rays using 5 prevalent labels: Atelectasis,
Cardiomegaly, Consolidation, Edema, and Pleural Effusion. We utilize three publicly available
datasets: MIMIC-CXR (source)[ 29,28,22], CheXpert (target)[ 26], and VinDr-CXR (target)[ 38]. We
see the following distribution shifts: Demographics: MIMIC-CXR’s images were collected between
2011 and 2016, CheXpert’s between 2002 and 2017, and VinDr-CXR’s between 2018 and 2020.
CheXpert’s data was collected by Stanford University School of Medicine, MIMIC-CXR’s from
the Beth Israel Deaconess Medical Center in Boston, and VinDr-CXR’s from the Hanoi Medical
University Hospital and Hospital 108 in Vietnam. Annotation Details: MIMIC-CXR and CheXpert
use automated natural language labelers, while VinDr-CXR uses radiologist-generated annotations.
3.4 Mammograms
Mammograms consist of 2D grayscale images of the cranio-caudal (CC) view and the mediolateral-
oblique (MLO) view of the left and right breast of a patient (4 images possible per patient), and are
the main imaging tool for the screening and diagnosis of breast cancer[ 9]. Mammograms are high-
resolution images with millions of pixels, and while the breast views are highly standardized, disease
classification depends on abnormalities in small regions of interest, making diagnosis challenging
for AI models. We perform a single-label task of predicting the Breast Imaging Reporting and Data
System (BI-RADS) assessment category (from 1 to 5) for each breast image. We consider two datasets:
VinDr-Mammo (source) [ 39] and CBIS-DDSM (target)[ 35,13,51]. We see the following distribution
shifts: Demographics: VinDr-Mammo was compiled from a pool of mammography examinations
taken between 2018 and 2020, while CBIS-DDSM was compiled from exams conducted between
1988 and 1999. VinDr-Mammo exams were collected by hospitals in Vietnam, while CBIS-DDSM’s
were collected by United States hospitals. Collection Technology: VinDr-Mammo contains full-field
digital mammogram images while CBIS-DDSM contains scanned film mammogram images. Several
different scanners from multiple manufacturers were used to collect the CBIS-DDSM mammograms.
Annotation Details: VinDr-Mammo contains some lesionless images, which still have a breast-level
BI-RADS score. The CBIS-DDSM dataset, however, exclusively contains images with lesions, and
does not annotate breast-level BI-RADS scores. To test our model on this target dataset, for each
breast we use the maximum of lesion-level BI-RADS scores as the breast-level BI-RADS score.
3.5 Dermoscopic Images
Dermoscopy produces 2D RGB images showing subsurface skin structures in the epidermis, at the
dermoepidermal junction, and in the papillary dermis, and is used to assess cancer in skin lesions.
Performing tasks on dermoscopic images is complicated by intraclass variability in lesion texture,
scale, and color due to presence of different skin colors, hair, veins, and irregular lesion borders
[24]. We perform a single-label classification of 2D RGB dermoscopy images across 5 unified
labels extracted by clinicians: “AKIEC" (includes actinic keratoses, intraepithelial carcinoma, and
squamous cell carcinoma as all of these are with the continuum of squamous cell carcinoma), “BCC"
(basal cell carcinoma), “MEL" (melanoma), “NEV" (nevus), and “Other diseases" ( dermatofibroma,
etc). We utilize three publicly available datasets: BCN 20000 (source) [ 14], HAM 10000 (target)[ 58],
PAD-UFES-20 Smartphone image-set (target) [ 43]. We see the following distribution shifts: De-
mographics: BCN20000’s images were collected from 2010 to 2016, PAD-UEFS-20’s from 2020,
and HAM10000’s from the past 20 years. PAD-UEFS-20’s images were collected by hospitals in
Brazil, HAM10000’s in Austria and Australia, and BCN20000’s in Spain. Collection Technology:
BCN20000 and HAM10000 images were collected using dermatoscopes, while PAD-UFES-20
images were collected by smartphone cameras. Annotation Details: Although we grouped the
abnormality annotations across datasets into 5 general categories, the granularity within each label
varies depending on the dataset. For example, the “Other diseases" category for HAM10000 includes
benign keratosis-like lesions while BCN20000’s doesn’t.
3.6 Fundus Images
Eye fundus images are 2D RGB images showing the interior surface of a single eye, including the
retina, fovea, optic disc, macula, and posterior pole, and are crucial for the diagnosis of diabetic
retinopathy (DR). The detection of DR is complicated by spurious correlations with other undetected
5
Page 6:
conditions such as diabetic macular edema[ 2]. For each 2D RGB fundus image, we perform the
single-label task of predicting the severity of diabetic retinopathy (DR) in an image of each eye. We
use the International Clinic Diabetic Retinopathy (ICDR) classification scale, which classifies DR
on a five-stage severity scale from 0-4[ 48]. We consider three datasets: Messidor-2 (source)[ 15,1],
APTOS 2019 (target)[ 57,4], and the Jinchi Medical University dataset (target)[ 53]. We see the
following distribution shifts: Demographics: Messidor-2 images were collected from 2004 to 2010,
while Jinchi Medical University images were collected between May 2011 and June 2015. The total
collection period for APTOS 2019 is unknown. The Messidor2 data was collected from French
institutions, APTOS 2019 data from the Aravind Eye Care System in India, and the Jinchi Medical
University data from Japan. The Messidor-2 and Jinchi Medical University datasets consist of high
quality retinal images, while APTOS 2019 exhibits more variation in data quality, including images
with artifacts. Collection Technology: Messidor-2 training images were taken with a Topcon TRC
NW6 non-mydriatic camera. The Jinchi Medical University dataset also uses a non-mydriatic camera,
but a different model (AFC-230). The APTOS dataset contains images taken from both mydriatic
and non-mydriatic cameras, with the full range of camera models unknown. The Jinchi Medical
University dataset was collected in a single-site, exploratory study performed in an institutional
setting, whereas the other datasets contain images taken in clinical settings for diagnostic purposes.
Annotation Details: Jinchi Medical University, unlike Messidor-2 and APTOS, consolidates the
similar ICDR classes 1 and 2 into a single superclass, termed a modified Davis grading.
3.7 Low Dose Computed Tomography Scans
Low dose computed tomography (LDCT) is a procedure that uses an x-ray machine linked with a
computer to create 3D images of a patient’s tissues and organs. LDCT is typically used to detect
early-stage nodules of lung cancer in high-risk patients. The LDCT nodule classification task is
challenging since LDCT scans are 3D images originally recorded in single-channel Hounsfield units
with varying numbers of slices between patients. In addition, while scans have a large field of view
with hundreds of slices, nodules only occupy a small volume of the scan, especially in early cancer
stages[ 25]. Inputs are partitioned by sliding windows, representing 24 CT slices in single channel
Hounsfield units. We perform two binary classification tasks, determining 1) whether a small nodule
(diameter ≤3mm) exists in the current CT scan window and 2) whether a large nodule (diameter
≤3mm) exists in the current CT scan window. A sliding window is labeled positive for a nodule of
either type if it contains more than 4 consecutive slices with positive labels. Final determination,
at the volume level, for both small and large nodule(s) presence is done by aggregating prediction
probabilities from all windows. We utilize two public datasets: LIDC-IDRI (source)[ 5] and LNDb
(target)[ 45]. We see the following distribution shifts: Demographics: LIDC scans were collected
in 2010, while LNDb scans were collected from 2016-2018. LIDC was collected from academic
centers and medical imaging companies in the United States, while LNDb was collected at the Centro
Hospitalar e Universitário de São João (CHUSJ) in Porto, Portugal. Collection Technology: LIDC
dataset collection involved a variety of scanner manufacturers and models, while the LNDb dataset
was primarily collected by Siemens scanners. LIDC’s data was collected using a mean tube current
of 222.1mA, while LNDb use a mean tube current of 161.9mA. The LIDC dataset includes slice
thicknesses ranging from 0.6mm to 5mm, while the LNDb dataset has excluded CT scans where
intravenous contrast had been used and those with a slice thickness greater than 1mm.
4 Experiments
We evaluate the performance of five baseline techniques: three SSL algorithms, ImageNet pretraining,
and training from scratch. We then test performance on OOD target datasets using multiple transfer
learning schemes.
4.1 Architecture
Following [ 55], we utilize a modality-agnostic transformer architecture across all experiments. We
use separate 1D, 2D, and 3D embedding modules, which make minimal assumptions about the data
and map all inputs to the same 256-dimensional embedding space, allowing users to mix inputs
with different input dimensions. The encoder is based on a standard vision transformer architecture
6
Page 7:
Figure 3: The in-distribution and out-of-distribution performance of models across modalities. OOD
performance is averaged across target dataset(s).
[59,16], and we choose 1D, 2D, or 3D patch sizes to keep the resulting sequence length similar
across all datasets. Additional details on the architecture are available in Appendix B.
4.2 Pretraining
We evaluate the performance of three different SSL algorithms in this benchmark. The first two,
Contrastive Embedding-Mixup (e-Mix) and Shuffled Embedding Prediction (ShED) , follow [ 55].
e-Mix is a contrastive objective that additively mixes a batch of original input embeddings, weighting
them with different coefficients. It then trains an encoder to produce a vector for a mixed embedding
that is close to the original inputs’ embeddings in proportion to their mixing coefficients. ShED
shuffles a fraction (0.85 in our experiments) of embeddings and trains the encoder with a classifier to
predict which embeddings were perturbed. Following the training settings used in [ 56], we also use a
third Masked Autoencoding (MAE) objective, which masks a given fraction (0.75 in our experiments)
of input embeddings and trains models to reconstruct them [ 23]. We standardize the pretraining
process, running it for 100k steps with the Adam optimizer, learning rate 1e-4, weight decay 1e-4,
and momentum 0.9. Beyond SSL, we evaluate two other techniques. First, we consider a scratch
baseline, where the model is not pretrained. In addition, for 2D modalities, we evaluate models
pretrained on ImageNet .
4.3 Transfer Learning and Out-of-Distribution Evaluation
We train models for particular tasks through both linear evaluation and finetuning, using labeled
data from our in-distribution source datasets. We then evaluate zero-shot performance on OOD
target datasets. For linear evaluation, we freeze the model backbone and train a linear classifier
head for the modality task using 100% of the source data labels. For finetuning, we run one set of
experiments using 100% of the source labels but also test performance while varying label availability.
For single-label tasks, we run experiments using 8, 64, or 256 labels per class, which we refer to as
small, medium, and large label fractions, respectively. If the source dataset contains a class for which
we have insufficient labels, we simply use all available labels for that class. For multi-label tasks,
we create small/medium/large label sets by iterating through each class label and sampling labeled
examples until we have 8, 64, or 256 labels for that class or have exhausted the available examples
for that class. During both linear evaluation and finetuning, we train for 100 epochs with the Adam
optimizer, learning rate 1e-4, weight decay 1e-4, and momentum 0.9. We evaluate our models using
AUROC score as the metric (taking an unweighted average of per-class scores for multi-class tasks).
7
Page 8:
We then evaluate zero-shot transfer performance on OOD target datasets. After every epoch of linear
evaluation or finetuning, we check the current model checkpoint’s performance on the source dataset’s
validation set in order to perform model selection. We identify the top-performing checkpoint that
achieves the highest average AUROC across tasks on the source validation set and report this AUROC
in Figure 3 under “In Distribution". Next, we perform zero-shot transfer with this top-performing
checkpoint by directly evaluating it on the target dataset(s), without any further training. We report
AUROC on OOD data for each modality, averaged across tasks and target datasets, in Figure 3.
4.4 Results
Figure 3 shows the performance of different techniques on the ID validation set and on our OOD test
data. Overall, we find that no one method strictly dominates the others.
Does any SSL technique offer high performance across modalities? No. e-Mix typically offers
middling performance on OOD data; it outperforms other techniques on only on a few scattered
experiments across ECG data, dermoscopic images and retinal fundus images. ShED is more
promising, with particularly strong performance on ECG data and LDCT scans. However, it fails to
maintain consistent performance across other modalities such as fundus images and mammograms,
where ImageNet pretraining outperforms it across nearly all settings. Similarly, MAE achieves
strong results on many experiments, especially on EEG data and on dermoscopic images from the
PAD-UFES-20 dataset, it also performs poorly elsewhere. Despite being a top performer on EEG data,
MAE achieves poor AUROC scores on ECG data, particularly the Georgia 12-Lead ECG Challenge
and Chapman-Shaoxing datasets. This discrepancy indicates the difficulty of developing a single,
high-performing technique even across different 1D sensor modalities. We see similar inconsistencies
across different 2D image modalities, and future work may explore whether other SSL techniques or
architectures offer more consistent performance.
Does any other technique offer high performance across modalities? No. We investigate the
use of ImageNet pretraining on 2D modalities and find it performs well across several modalities,
typically outperforming other techniques on CXRs, mammograms, and fundus images from the
APTOS dataset. However, SSL methods sometimes outperform ImageNet pretraining, with a
particularly large gap on OOD dermoscopic images. Additionally, while models trained from scratch
are rarely top-performers in any experiment, they still remain competitive, frequently outperforming
at least one other model. The fact that ImageNet pretraining and training from scratch can sometimes
match SSL performance demonstrates the difficulty of using SSL techniques out-of-the-box, without
customization for particular medical modalities. We further explored a two-stage approach on the
2D mammogram and fundus image modalities, pre-training models on ImageNet before performing
SSL using the MAE objective with some hyperparameter tuning, and found that this approach can
yield benefits over either IN or SSL alone. (See supplement appendix tables.) Furthermore, while we
standardized training times by performing the same number of iterations across all modalities, future
work may explore other ways to set these and other hyperparameters.
Does label availability affect performance? Yes. Across all techniques and modalities, OOD
performance typically stays the same or improves when more labels are available, though we see
rare exceptions. For instance, MAE performance on mammograms drops when finetuning on 100%
of the data, suggesting that the model may have overfit. Once again, future work may benefit from
dynamically adjusting the finetuning process to prevent overfitting.
How does performance change across distributions? Though we sometimes see promising
generalization performance, there are also cases where performance drops on OOD datasets. On the
100% fine-tuning settings for ECG data, EEG data and mammograms, the top-performing technique
on the in-distribution validation set also achieves the best performance across all OOD datasets,
suggesting that generalization is successful. However, we see other cases of performance degradation
due to distribution shift. For example, training from scratch achieves near-perfect performance
on in-distribution dermoscopic images when fine-tuned with 64 or more data points. However,
these models generalize poorly to OOD dermoscopic data, where training from scratch is never
the top-performing technique on any experiment. Similarly, e-Mix is the top performer on most
in-distribution LDCT experiments, yet it performs worst on all experiments using OOD data. Future
work may use regularization techniques to improve generalization performance.
8
Page 9:
5 Limitations
While BenchMD generally aims to treat all modalities the same, we follow DABS’s example in
providing different embedding modules for 1D, 2D, and 3D data [ 55]. Users can replace these
modules with their own, and we are hopeful that future approaches will identify ways to unify even
this step across modalities. Additionally, while SSL has demonstrated promise in the medical domain,
our SSL baselines achieve modest performance and fail to provide consistent benefits over ImageNet
pretraining on 2D modalities. This may be because prior objectives do not generalize well across the
diverse domains we consider. The ImageNet training technique we present is also inherently limited,
as it is only appropriate for 2D images; as pretraining on natural images appears to offer benefits,
future work may extend this approach to 1D and 3D data, such as by incorporating natural video
pretraining as well. Our benchmark also currently allows easy modification of architectures and
dataset types used in the training pipeline, flexibly allowing for future training with joint modality
datasets and domain-agnostic Perceivers or CNNs. Finally, while we endeavor to cover a diverse
range of medical modalities and datasets, it is impossible to fully represent the breadth of data in the
medical domain. To protect patient safety, medical AI models should undergo further validation, such
as through site-specific testing, before being deployed.
6 Conclusion
We present BenchMD, a benchmark for evaluating unified methods across medical image and sensor
modalities. While our initial baselines show some potential, there are ample opportunities for future
work to improve both versatility and performance. Methods that succeed on BenchMD may also
be applicable to many other modalities and distributions and can have real-world impact on clinical
practice. We hope BenchMD will help promote the development of high-performing, generalizable
and label-efficient methods for universal learning.
Acknowledgments and Disclosure of Funding
This project was supported by AWS Promotional Credits and by Harvard Data Science Institute
Competitive Research Award. AT is supported by an Open Phil AI Fellowship.
References
[1]M. D. Abràmoff, J. C. Folk, D. P. Han, J. D. Walker, D. F. Williams, S. R. Russell, P. Massin,
B. Cochener, P. Gain, L. Tang, et al. Automated analysis of retinal images for detection of
referable diabetic retinopathy. JAMA ophthalmology , 131(3):351–357, 2013. 6
[2]M. D. Abràmoff, Y . Lou, A. Erginay, W. Clarida, R. Amelon, J. C. Folk, and M. Niemeijer.
Improved automated detection of diabetic retinopathy on a publicly available dataset through
integration of deep learning. Invest. Ophthalmol. Vis. Sci. , 57(13):5200–5206, Oct. 2016. 4, 6
[3]J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican,
M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro,
J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi ´nkowski,
R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model
for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh,
editors, Advances in Neural Information Processing Systems , volume 35, pages 23716–23736.
Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/
paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf . 3
[4]APTOS 2019 Blindness Detection. APTOS 2019 blindness detection. https://www.kaggle.
com/competitions/aptos2019-blindness-detection/data . Accessed: 2022-11-11. 6
[5]S. G. Armato III, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R. Meyer, A. P. Reeves,
B. Zhao, D. R. Aberle, C. I. Henschke, E. A. Hoffman, et al. The lung image database
consortium (lidc) and image database resource initiative (idri): a completed reference database
of lung nodules on ct scans. Medical physics , 38(2):915–931, 2011. 6
9
Page 10:
[6]S. Azizi, L. Culp, J. Freyberg, B. Mustafa, S. Baur, S. Kornblith, T. Chen, P. MacWilliams,
S. Sara Mahdavi, E. Wulczyn, B. Babenko, M. Wilson, A. Loh, P.-H. C. Chen, Y . Liu, P. Bav-
ishi, S. M. McKinney, J. Winkens, A. G. Roy, Z. Beaver, F. Ryan, J. Krogue, M. Etemadi,
U. Telang, Y . Liu, L. Peng, G. S. Corrado, D. R. Webster, D. Fleet, G. Hinton, N. Houlsby,
A. Karthikesalingam, M. Norouzi, and V . Natarajan. Robust and efficient medical imaging with
Self-Supervision. ArXiv , May 2022. 3
[7]A. Bandyopadhyay and C. Goldstein. Clinical applications of artificial intelligence in sleep
medicine: a sleep clinician’s perspective. Sleep and Breathing , Mar. 2022. 4
[8]K. Blagec, J. Kraiger, W. Frühwirt, and M. Samwald. Benchmark datasets driving artificial
intelligence development fail to capture the needs of medical professionals. ArXiv , Jan. 2022. 1
[9]S. Boumaraf, X. Liu, C. Ferkous, and X. Ma. A new Computer-Aided diagnosis system with
modified genetic feature selection for BI-RADS classification of breast masses in mammograms.
Biomed Res. Int. , 2020:7695207, May 2020. 5
[10] S. R. Bowman and G. E. Dahl. What will it take to fix benchmarking in natural language
understanding? ArXiv , Apr. 2021. 1
[11] Z. Cai, C. Liu, H. Gao, X. Wang, L. Zhao, Q. Shen, E. Ng, and J. Li. An open-access long-term
wearable ecg database for premature ventricular contractions and supraventricular premature
beat detection. Journal of Medical Imaging and Health Informatics , 10(11):2663–2667, 2020. 4
[12] W. Chiao and M. L. Durr. Trends in sleep studies performed for medicare beneficiaries. The
Laryngoscope , 127(12):2891–2896, 2017. 4
[13] K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt,
M. Pringle, L. Tarbox, and F. Prior. The cancer imaging archive (TCIA): maintaining and
operating a public information repository. J. Digit. Imaging , 26(6):1045–1057, Dec. 2013. 5
[14] M. Combalia, N. C. Codella, V . Rotemberg, B. Helba, V . Vilaplana, O. Reiter, C. Carrera,
A. Barreiro, A. C. Halpern, S. Puig, et al. Bcn20000: Dermoscopic lesions in the wild. arXiv
preprint arXiv:1908.02288 , 2019. 5
[15] E. Decencière, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone, P. Gain, R. Ordonez,
P. Massin, A. Erginay, et al. Feedback on a publicly distributed image database: the messidor
database. Image Analysis & Stereology , 33(3):231–234, 2014. 6
[16] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for
image recognition at scale. arXiv preprint arXiv:2010.11929 , 2020. 7
[17] B. Duce, C. Rego, J. Milosavljevic, and C. Hukins. The AASM recommended and acceptable
EEG montages are comparable for the staging of sleep and scoring of EEG arousals. J. Clin.
Sleep Med. , 10(7):803–809, July 2014. 4
[18] L. Ericsson, H. Gouk, C. C. Loy, and T. M. Hospedales. Self-Supervised representation learning:
Introduction, advances, and challenges. IEEE Signal Process. Mag. , 39(3):42–62, May 2022. 1
[19] F. C. Ghesu, B. Georgescu, A. Mansoor, Y . Yoo, D. Neumann, P. Patel, R. S. Vishwanath, J. M.
Balter, Y . Cao, S. Grbic, and D. Comaniciu. Self-supervised learning from 100 million medical
images. ArXiv , Jan. 2022. 3
[20] R. Girdhar, A. El-Nouby, M. Singh, K. V . Alwala, A. Joulin, and I. Misra. OmniMAE: Single
model masked pretraining on images and videos. Arxiv , June 2022. 3
[21] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E.
Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet:
components of a new research resource for complex physiologic signals. Circulation , 101(23):
E215–20, June 2000. 4
10
Page 11:
[22] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E.
Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Physiobank, physiotoolkit, and physionet:
components of a new research resource for complex physiologic signals. circulation , 101(23):
e215–e220, 2000. 4, 5
[23] K. He, X. Chen, S. Xie, Y . Li, P. Doll’ar, and R. B. Girshick. Masked autoencoders are scalable
vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) , pages 15979–15988, 2021. 7
[24] I. Hoorens, K. V ossaert, S. Lanssens, L. Dierckxsens, G. Argenziano, and L. Brochez. Value
of dermoscopy in a Population-Based screening sample by dermatologists. Dermatol Pract
Concept , 9(3):200–206, July 2019. 4, 5
[25] E. Immonen, J. Wong, M. Nieminen, L. Kekkonen, S. Roine, S. Törnroos, L. Lanca, F. Guan,
and E. Metsälä. The use of deep learning towards dose optimization in low-dose computed
tomography: A scoping review. Radiography , 2021. 6
[26] J. Irvin, P. Rajpurkar, M. Ko, Y . Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo,
R. Ball, K. Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty
labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence ,
volume 33, pages 590–597, 2019. 4, 5, 14
[27] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira. Perceiver: General
perception with iterative attention. In International conference on machine learning , pages
4651–4664. PMLR, 2021. 3
[28] A. Johnson, T. Pollard, R. Mark, S. Berkowitz, and S. Horng. MIMIC-CXR database, Sept.
2019. 5
[29] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G.
Mark, and S. Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs
with free-text reports. Scientific data , 6(1):1–8, 2019. 5
[30] A. Ke, W. Ellsworth, O. Banerjee, A. Y . Ng, and P. Rajpurkar. CheXtransfer: performance and
parameter efficiency of ImageNet models for chest X-Ray interpretation. In Proceedings of
the Conference on Health, Inference, and Learning , CHIL ’21, pages 116–124, New York, NY ,
USA, Apr. 2021. Association for Computing Machinery. 5
[31] S. Khalighi, T. Sousa, J. M. Santos, and U. Nunes. Isruc-sleep: A comprehensive public dataset
for sleep researchers. Computer methods and programs in biomedicine , 124:180–192, 2016. 4
[32] P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Ya-
sunaga, R. L. Phillips, I. Gao, T. Lee, E. David, I. Stavness, W. Guo, B. Earnshaw, I. Haque,
S. M. Beery, J. Leskovec, A. Kundaje, E. Pierson, S. Levine, C. Finn, and P. Liang. WILDS: A
benchmark of in-the-wild distribution shifts. In M. Meila and T. Zhang, editors, Proceedings
of the 38th International Conference on Machine Learning , volume 139 of Proceedings of
Machine Learning Research , pages 5637–5664. PMLR, 2021. 2, 3
[33] R. Krishnan, P. Rajpurkar, and E. J. Topol. Self-supervised learning in medicine and healthcare.
Nat Biomed Eng , Aug. 2022. 2
[34] K. Lee, Y . Zhu, K. Sohn, C.-L. Li, J. Shin, and H. Lee. i-mix: A Domain-Agnostic strategy for
contrastive representation learning. ArXiv , Oct. 2020. 3
[35] R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy, and D. L. Rubin. A curated
mammography data set for use in computer-aided detection and diagnosis research. Scientific
data, 4(1):1–9, 2017. 5
[36] V . Likhosherstov, A. Arnab, K. Choromanski, M. Lucic, Y . Tay, A. Weller, and M. Dehghani.
PolyViT: Co-training vision transformers on images, videos and audio. Arxiv , Nov. 2021. 3
[37] D. L. Monticciolo, S. F. Malak, S. M. Friedewald, P. R. Eby, M. S. Newell, L. Moy, S. Destounis,
J. W. T. Leung, R. E. Hendrick, and D. Smetherman. Breast cancer screening recommendations
inclusive of all women at average risk: Update from the ACR and society of breast imaging. J.
Am. Coll. Radiol. , 18(9):1280–1288, Sept. 2021. 4
11
Page 12:
[38] H. Q. Nguyen, K. Lam, L. T. Le, H. H. Pham, D. Q. Tran, D. B. Nguyen, D. D. Le, C. M. Pham,
H. T. Tong, D. H. Dinh, et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s
annotations. Scientific Data , 9(1):1–7, 2022. 5
[39] H. T. Nguyen, H. Q. Nguyen, H. H. Pham, K. Lam, L. T. Le, M. Dao, and V . Vu. Vindr-
mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital
mammography. medRxiv , 2022. 5
[40] I. of Medicine. Sleep Disorders and Sleep Deprivation: An Unmet Public Health Prob-
lem. The National Academies Press, Washington, DC, 2006. ISBN 978-0-309-10111-0.
doi: 10.17226/11617. URL https://nap.nationalacademies.org/catalog/11617/
sleep-disorders-and-sleep-deprivation-an-unmet-public-health-problem . 4
[41] OpenAI. GPT-4 technical report. Arxiv , Mar. 2023. 1
[42] S. Ott, A. Barbosa-Silva, K. Blagec, J. Brauner, and M. Samwald. Mapping global dynamics of
benchmark creation and saturation in artificial intelligence. Nat. Commun. , 13(1):6793, Nov.
2022. 1
[43] A. G. Pacheco, G. R. Lima, A. S. Salomão, B. Krohling, I. P. Biral, G. G. de Angelo, F. C.
Alves Jr, J. G. Esgario, A. C. Simora, P. B. Castro, et al. Pad-ufes-20: A skin lesion dataset
composed of patient data and clinical images collected from smartphones. Data in brief , 32:
106221, 2020. 5
[44] V . S. Parekh, S. Lai, V . Braverman, J. Leal, S. Rowe, J. J. Pillai, and M. A. Jacobs. Cross-Domain
federated learning in medical imaging. Arxiv , Dec. 2021. 2
[45] J. Pedrosa, G. Aresta, C. Ferreira, M. Rodrigues, P. Leitão, A. S. Carvalho, J. Rebelo, E. Negrão,
I. Ramos, A. Cunha, et al. Lndb: a lung nodule database on computed tomography. arXiv
preprint arXiv:1911.08434 , 2019. 6
[46] S. F. Quan, B. V . Howard, C. Iber, J. P. Kiley, F. J. Nieto, G. T. O’Connor, D. M. Rapoport,
S. Redline, J. Robbins, J. M. Samet, et al. The sleep heart health study: design, rationale, and
methods. Sleep , 20(12):1077–1085, 1997. 4
[47] D. Raji, E. Denton, E. M. Bender, A. Hanna, and A. Paullada. Ai and the everything in the
whole wide world benchmark. In J. Vanschoren and S. Yeung, editors, Proceedings of the
Neural Information Processing Systems Track on Datasets and Benchmarks , volume 1. Curran,
2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_
files/paper/2021/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper-round2.pdf .
1
[48] S. Ramchandre, B. Patil, S. Pharande, K. Javali, and H. Pande. A deep learning approach for
diabetic retinopathy detection using transfer learning. In 2020 IEEE International Conference
for Innovation in Technology (INOCON) , pages 1–5, Nov. 2020. 6
[49] S. Sagawa, P. W. Koh, T. Lee, I. Gao, S. M. Xie, K. Shen, A. Kumar, W. Hu, M. Yasunaga,
H. Marklund, S. Beery, E. David, I. Stavness, W. Guo, J. Leskovec, K. Saenko, T. Hashimoto,
S. Levine, C. Finn, and P. Liang. Extending the WILDS benchmark for unsupervised adaptation.
Arxiv , Dec. 2021. 3
[50] A. Schreuder, E. T. Scholten, B. van Ginneken, and C. Jacobs. Artificial intelligence for
detection and characterization of pulmonary nodules in lung cancer CT screening: ready for
practice? Transl Lung Cancer Res , 10(5):2378–2388, May 2021. 4
[51] K. Smith. Curated breast imaging subset of digital database for screening mammography
(CBIS-DDSM) - the cancer imaging archive (TCIA) public access - cancer imaging archive
wiki. https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=
22516629 . Accessed: 2022-11-11. 5
[52] N. Sridhar, A. Shoeb, P. Stephens, A. Kharbouch, D. B. Shimol, J. Burkart, A. Ghoreyshi, and
L. Myers. Deep learning for automated sleep staging using instantaneous heart rate. NPJ digital
medicine , 3(1):1–10, 2020. 14
12
Page 13:
[53] H. Takahashi, H. Tampo, Y . Arai, Y . Inoue, and H. Kawashima. Applying artificial intelligence
to disease staging: Deep learning for improved staging of diabetic retinopathy. PLoS One , 12
(6):e0179790, June 2017. 6, 19
[54] A. Tamkin, M. Wu, and N. Goodman. Viewmaker networks: Learning views for unsupervised
representation learning. arXiv preprint arXiv:2010.07432 , 2020. 3
[55] A. Tamkin, V . Liu, R. Lu, D. E. Fein, C. Schultz, and N. D. Goodman. Dabs: A domain-agnostic
benchmark for self-supervised learning. ArXiv , abs/2111.12062, 2021. 1, 2, 3, 6, 7, 9, 14
[56] A. Tamkin, G. Banerjee, M. Owda, V . Liu, S. Rammoorthy, and N. Goodman. Dabs
2.0: Improved datasets and algorithms for universal self-supervision. In S. Koyejo,
S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neu-
ral Information Processing Systems , volume 35, pages 38358–38372. Curran Associates,
Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/
fa73aca7b2af724fafbd4852957cd3e0-Paper-Datasets_and_Benchmarks.pdf . 1, 2,
3, 7
[57] The 4th Asia Pacific Tele-Ophthalmology Society Symposium. The 4th asia pacific Tele-
Ophthalmology society symposium. https://2019.asiateleophth.org/ . Accessed: 2022-
11-11. 6
[58] P. Tschandl, C. Rosendahl, and H. Kittler. The ham10000 dataset, a large collection of multi-
source dermatoscopic images of common pigmented skin lesions. Scientific data , 5(1):1–9,
2018. 5
[59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin. Attention is all you need. Advances in neural information processing systems ,
30, 2017. 7
[60] P. Wagner, N. Strodthoff, R.-D. Bousseljot, D. Kreiseler, F. I. Lunze, W. Samek, and T. Schaeffter.
Ptb-xl, a large publicly available electrocardiography dataset. Scientific data , 7(1):1–15, 2020.
4
[61] P. Wagner, N. Strodthoff, R.-D. Bousseljot, W. Samek, and T. Schaeffter. PTB-XL, a large
publicly available electrocardiography dataset, Nov. 2022. 4
[62] A. B. Wolbarst, P. Capasso, and A. R. Wyant. Medical Imaging: Essentials for Physicians .
John Wiley & Sons, Apr. 2013. 2
[63] H. Yao, C. Choi, B. Cao, Y . Lee, P. W. Koh, and C. Finn. Wild-Time: A benchmark of
in-the-wild distribution shift over time. Oct. 2022. 3
[64] X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djolonga, A. S.
Pinto, M. Neumann, A. Dosovitskiy, L. Beyer, O. Bachem, M. Tschannen, M. Michalski,
O. Bousquet, S. Gelly, and N. Houlsby. A large-scale study of representation learning with the
visual task adaptation benchmark. Arxiv , Oct. 2019. 3
[65] J. Zheng, J. Zhang, S. Danioko, H. Yao, H. Guo, and C. Rakovski. A 12-lead electrocardiogram
database for arrhythmia research covering more than 10,000 patients. Scientific data , 7(1):1–8,
2020. 4
[66] L. Zhou, H. Liu, J. Bae, J. He, D. Samaras, and P. Prasanna. Self pre-training with masked
autoencoders for medical image analysis. Arxiv , Mar. 2022. 3
[67] Z. Zhou, V . Sodha, J. Pang, M. B. Gotway, and J. Liang. Models genesis. Med. Image Anal. , 67:
101840, Jan. 2021. 3
[68] H. Zhu, C. Cheng, H. Yin, X. Li, P. Zuo, J. Ding, F. Lin, J. Wang, B. Zhou, Y . Li, S. Hu,
Y . Xiong, B. Wang, G. Wan, X. Yang, and Y . Yuan. Automatic multilabel electrocardiogram
diagnosis of heart rhythm or conduction abnormalities with deep learning: a cohort study.
Lancet Digit Health , 2(7):e348–e357, July 2020. 4
13
Page 14:
A Additional Results
See Table 1 and 2 for our full set of results.
B Methods Details
Our domain-agnostic transformer architecture contains 9.7M parameters, the same as the ImageNet-
pretrained ViT-T we use to compare performance for 2D modalities. The domain-agnostic architecture
is the same as that used in [ 55] and all architecture implementations can be found in our repository
here: https://github.com/rajpurkarlab/BenchMD/tree/main/src/models . We trained in single-
gpu mode with A10G Tensor Core GPUs, and the batch size used in all experiments was 64, with
the exception of LDCT experiments, where we used a batch size of 32. Each pretraining run takes
roughly 30 hours, and each transfer learning run takes up to 1 day, with most experiments finishing in
under 8 hours.
C Dataset Input and Label Preprocessing
C.1 ECG
Preprocessing All data was exhaustively cropped to 10 second segments, sampled at 500 Hz,
making each input a 1D vector of 2500 components with 12 channels. Leftover segments of less that
10 seconds were dropped. We made significant re-labeling progress to create an unified machine
learning task for each of the datasets. Our finally 7 classes include: Normal, CD (Conduction
Disturbance), HYP (Hypertrophy), MI (Myocardial Infarction), STTC (ST-T wave Change-ischemia),
A. Fib/ Aflutter (Atrial fibrillation/ Atrial flutter), and Other. Below is a name mapping from each
original dataset to our new label formulation.
Task Standardization The label consolidation for each ECG dataset under the the 7-class task is
given in Tables 3-6. The distribution of classes for ECG datasets is shown in Table 7.
C.2 EEG
Preprocessing The Sleep Heart Health Study dataset consists of two rounds of polysomnographic
recordings (SHHS-1 and SHHS-2) sampled at 125 Hz, and we only use SHHS-1, containing 5,793
records over two channels (C4-A1 and C3-A2). Recordings are manually classified into one of six
classes (W, N1, N2, N3, N4 and REM). In SHHS, we have an additional stage N4, which we merge
with the N3 stage, matching the five stages of sleep according to the American Academy of Sleep
Medicine (AASM) [ 52]. Each channel of the EEG recording is a vector of 3750 components, (125
Hz×30 second recording), and one patient has multiple recording epochs of 30 seconds.
The recordings from the transfer dataset (ISRUC) consist of channels C3 and C4, which were also
segmented into epochs of 30 seconds. ISRUC dataset was downsampled to 125Hz from the original
150Hz to match SHHS.
The distribution of classes for EEG datasets is shown in Table 8.
C.3 Chest X-Rays
Preprocessing During training, we load each 3-channel JPG image, transform it to a 1-channel
grayscale image (except for the VinDr-CXR dataset, where we extract the grayscale pixel array
directly from the DICOM file), resize the longer size to 224 pixels while maintaining the image’s
aspect ratio, perform per-channel standardization based on the training set statistics, and pad the
image with zeros to get a 224 ×224 final grayscale image. We selected the five competition categories
from CheXpert [ 26] as our classes: Atelectasis, Cardiomegaly, Consolidation, Edema, and Pleural
Effusion.
For VinDR-CXR dataset, We perform the standard pixel array extraction from the DICOM files:
1. Extract the single-channel grayscale “pixel_array" from the DICOM file.
14
Page 15:
Table 1: AUROC achieved for each modality: electrocardiograms (ECG), electroencephalograms
(EEG), chest X-rays (CXR), mammo- grams (Mammo), dermoscopic images (Derm), fundus images
(Fundus), and low-dose computed tomography (LDCT) scans. The source and target datasets are
labelled accordingly. We consider performance of each pretraining objective along the columns: no
pretraining (scratch), e-Mix, ShED, MAE, and ImageNet pretraining (IN). For each modality we
report results for each transfer learning setup along the rows: linear evaluation using 100/ FT-M /
FT-L / FT, respectively). For each dataset and transfer learning setup, we bold the max AUROC
achieved across the different pre-training methods.
Scratch e-Mix ShED MAE IN Scratch e-Mix ShED MAE IN
ECG PTB-XL (source) Chapman-Shaoxing (target)
LE 0.516 0.604 0.737 0.513 - 0.551 0.569 0.694 0.529 -
FT-S 0.677 0.705 0.733 0.514 - 0.504 0.703 0.723 0.530 -
FT-M 0.687 0.706 0.744 0.514 - 0.650 0.665 0.758 0.535 -
FT-L 0.730 0.716 0.797 0.537 - 0.672 0.744 0.762 0.655 -
FT 0.731 0.733 0.797 0.678 - 0.711 0.752 0.806 0.656 -
ECG Georgia (target) CPSC (target)
LE 0.527 0.566 0.692 0.492 - 0.523 0.532 0.687 0.527 -
FT-S 0.497 0.635 0.631 0.493 - 0.494 0.508 0.572 0.527 -
FT-M 0.543 0.685 0.693 0.493 - 0.577 0.573 0.718 0.573 -
FT-L 0.586 0.707 0.696 0.522 - 0.640 0.644 0.736 0.692 -
FT-F 0.654 0.691 0.769 0.572 - 0.691 0.636 0.734 0.724 -
EEG SHHS (source) ISRUC (target)
LE 0.581 0.527 0.537 0.446 - 0.564 0.479 0.525 0.526 -
FT-S 0.551 0.628 0.542 0.681 - 0.462 0.496 0.529 0.506 -
FT-M 0.553 0.673 0.553 0.729 - 0.498 0.479 0.589 0.580 -
FT-L 0.625 0.682 0.603 0.756 - 0.498 0.510 0.591 0.680 -
FT-F 0.675 0.690 0.640 0.758 - 0.552 0.525 0.583 0.673 -
CXR MIMIC (source) CheXpert (target)
LE 0.685 0.676 0.732 0.685 0.669 0.723 0.745 0.760 0.756 0.812
FT-S 0.685 0.677 0.731 0.685 0.730 0.720 0.740 0.732 0.720 0.762
FT-M 0.724 0.724 0.776 0.740 0.787 0.720 0.745 0.745 0.720 0.800
FT-L 0.740 0.730 0.782 0.742 0.790 0.720 0.746 0.770 0.723 0.812
FT-F 0.740 0.744 0.783 0.789 0.792 0.723 0.746 0.770 0.772 0.813
CXR VINDR-CXR (target)
LE 0.546 0.502 0.576 0.567 0.628
FT-S 0.533 0.520 0.551 0.575 0.594
FT-M 0.561 0.530 0.534 0.576 0.628
FT-L 0.562 0.534 0.576 0.576 0.632
FT-F 0.576 0.576 0.632 0.579 0.628
Mammo Vindr-Mammo (source) CBIS-DDSM (target)
LE 0.558 0.507 0.500 0.561 0.606 0.464 0.531 0.499 0.465 0.579
FT-S 0.570 0.541 0.509 0.564 0.544 0.444 0.513 0.528 0.490 0.541
FT-M 0.587 0.552 0.546 0.565 0.597 0.541 0.490 0.502 0.506 0.546
FT-L 0.601 0.579 0.531 0.590 0.594 0.565 0.470 0.520 0.597 0.520
FT-F 0.611 0.588 0.542 0.625 0.645 0.487 0.482 0.510 0.565 0.607
Derm BCN 20000 (source) HAM 10000 (target)
LE 0.638 0.673 0.605 0.794 0.847 0.609 0.657 0.750 0.755 0.846
FT-S 0.638 0.798 0.735 0.666 0.812 0.609 0.814 0.806 0.666 0.809
FT-M 0.997 0.932 0.794 0.777 0.818 0.826 0.790 0.760 0.732 0.823
FT-L 0.998 0.939 0.803 0.853 0.836 0.827 0.805 0.904 0.820 0.825
FT-F 0.998 0.955 0.982 0.987 0.996 0.827 0.966 0.967 0.986 0.877
Derm PAD-UFES-20 (target)
LE 0.487 0.589 0.640 0.642 0.648
FT-S 0.487 0.600 0.588 0.552 0.629
FT-M 0.585 0.591 0.684 0.752 0.647
FT-L 0.585 0.594 0.658 0.758 0.650
FT-F 0.596 0.598 0.656 0.794 0.660
Fundus Messidor-2 (source) APTOS 2019 (target)
LE 0.677 0.660 0.451 0.741 0.741 0.474 0.496 0.496 0.440 0.641
FT-S 0.647 0.625 0.441 0.713 0.679 0.453 0.422 0.552 0.381 0.602
FT-M 0.731 0.739 0.481 0.752 0.800 0.514 0.411 0.539 0.420 0.593
FT-L 0.831 0.870 0.507 0.891 0.890 0.535 0.450 0.573 0.449 0.683
FT-F 0.983 1.000 0.557 1.000 1.000 0.472 0.417 0.564 0.476 0.673
Fundus Jinchi Medical University (target)
LE 0.523 0.532 0.500 0.546 0.587
FT-S 0.583 0.612 0.500 0.495 0.505
FT-M 0.556 0.537 0.500 0.533 0.546
FT-L 0.561 0.653 0.488 0.546 0.679
FT-F 0.570 0.632 0.499 0.529 0.602
LDCT LIDC-IDRI (source) LNDb (target)
LE 0.607 0.783 0.779 0.760 - 0.578 0.495 0.621 0.615 -
FT-S 0.599 0.646 0.637 0.578 - 0.576 0.515 0.574 0.524 -
FT-M 0.617 0.697 0.717 0.743 - 0.622 0.514 0.625 0.660 -
FT-L 0.791 0.825 0.812 0.810 - 0.638 0.620 0.649 0.647 -
FT-F 0.797 0.821 0.817 0.818 - 0.623 0.597 0.661 0.647 -
15
Page 16:
Table 2: AUROC achieved for the 2D mammogram and fundus images modalities, with additional
results reported for performing MAE pre-training (the best performing SSL algorithm on source data
in both modalities). For the IN+MAE results columns, we also performed additional hyperparameter
tuning, trying learning rates of 1e-3, 1e-4, and 1e-5 combinations across the MAE pre-training and
transfer learning stages. For each dataset and transfer learning setup, we bold the max AUROC
achieved across the different pre-training methods, with the exception of the result for Jinchi Medical
University in the finetune-small transfer setting, where e-Mix still outperforms IN+MAE.
Scratch MAE IN+MAE IN Scratch MAE IN+MAE IN Scratch MAE IN+MAE IN
Mammo Vindr-Mammo (source) CBIS-DDSM (target)
LE 0.558 0.561 0.610 0.606 0.464 0.465 0.584 0.579
FT-S 0.570 0.564 0.575 0.544 0.444 0.490 0.516 0.541
FT-M 0.587 0.565 0.605 0.597 0.541 0.506 0.592 0.546
FT-L 0.601 0.590 0.605 0.594 0.565 0.597 0.577 0.520
FT-F 0.611 0.625 0.616 0.645 0.487 0.565 0.492 0.607
Fundus Messidor-2 (source) APTOS 2019 (target) Jinchi Medical University (target)
LE 0.677 0.741 0.751 0.741 0.474 0.440 0.553 0.641 0.523 0.546 0.562 0.587
FT-S 0.647 0.713 0.663 0.679 0.453 0.381 0.429 0.602 0.583 0.495 0.542 0.505
FT-M 0.731 0.752 0.782 0.800 0.514 0.420 0.477 0.593 0.556 0.533 0.675 0.546
FT-L 0.831 0.891 0.882 0.890 0.535 0.449 0.608 0.683 0.561 0.546 0.661 0.679
FT-F 0.983 1.000 1.000 1.000 0.472 0.476 0.621 0.673 0.570 0.529 0.674 0.602
Table 3: PTB-XL Label Mappings
Class Name PTB-XL Labels Included
Normal NORM, SARRH, SBRAD, SR, STACH
CD A VB, 1A VB, 2A VB, 3A VB, CD, CLBBB, CRBBB, ILBBB, IRBBB, IVCB IVCD, LAFB, LAFB/LPFB, LPFB,
LPR, PSVT, SV ARR, SVTAC, WPW
HYP HYP, ALAD, LAD, LAO/LAE, LVH, RAD, RHV , RVH, RAO/RAE, SEHYP, VCLVH
MI AMI, ALMI,ASMI, ILMI, IMI, INJAL, INJIL, INJLA, INVT, IPLMI, IPMI, LMI, MI, PMI
STTC ANEUR, DIG, EL, ISC_, ISCA, ISCAL, ISCAN, ISCAS, ISCI, ISCIL, ISCIN, ISCLA, LNGQT, NDT, NST_,
NT_, STD_, STE_, STTC, TAB_
A. Fib/ Aflutter AFIB, AFLT
Other ABQRS, ARAD, AXL, AXR, BIGU, HVOLT, LOWT, LVOLT, PACE,
PAC, PRC(S), PVC, QWA VE, SAG, and TRIGU
Table 4: Chapman-Shaoxing Label Mappings
Class Name Chapman-Shaoxing Labels Included
Normal NORM, SB, SR, ST
CD 1A VB, 2A VB2, A VB, A VNRT, AT, CA VB, CLBBB, IIA VBI, IVB, JEB, JPT, Nonspecific BBB, PRIE, PRWP,
PWC, SAAWR, SVT, VEB, VET, VPB, VPE, WA VN, WPW
HYP ALS, ARS, CR, LVH, LVHV , RAH, RA VC, RVH
MI MILW
STTC STDD, STE, STTC, STTU, TTW, TWO
A. Fib/ Aflutter AF, AFIB
Other ABI, APB, AQW, ERV , FQRS, LVQRSCL, LVQRSLL, PTW, UW, VB
2.Scale the pixel array by a factor of “RescaleSlope" attribute and add the value of the
“RescaleIntercept" to every pixel, if these attributes are available.
3. Rescale the array to pixel values between 0 and 255.
4.Invert the pixels if the “PhotometricInterpretation" attribute is set to “MONOCHROME1."
The distribution of classes for chest x-ray datasets is shown in Table 9.
C.4 Mammograms
Preprocessing The mammography data is distributed in the Digital Imaging and Communications
in Medicine (DICOM) file format, so to improve data access speeds during training, we preprocess
the data into JPG format. We first perform the standard pixel array extraction from the DICOM files:
1. Extract the single-channel grayscale “pixel_array" from the DICOM file.
2.Scale the pixel array by a factor of “RescaleSlope" attribute and add the value of the
“RescaleIntercept" to every pixel, if these attributes are available.
From here, we save the pixel arrays as JPGs:
16
Page 17:
Table 5: Georgia ECG Label Mappings
Class Name Georgia ECG Labels Included
Normal Bradycardia, sinus arrhythmia, sinus bradycardia, sinus rhythm, sinus tachycardia
CD 1st degree av block, 2nd degree av block, accelerated idioventricular rhythm, accelerated junctional rhythm, Atrial
pacing pattern, Atrial tachycardia, A V block, Brady Tachy syndrome, Bundle branch block, Cardiac dysrhythmia,
complete heart block, complete right bundle branch block, congenital incomplete atrioventricular heart block,
diffuse intraventricular block, ectopic rhythm, idioventricular rhythm, incomplete left bundle branch block,
incomplete right bundle branch block, junctional escape, junctional premature complex, junctional tachycardia,left
anterior fascicular block, left bundle branch block, left posterior fascicular block, mobitz type 2 second degree
atrioventricular block, mobitz type i wenckebach atrioventricular block, multifocal atrial tachycardia, paroxysmal
supraventricular tachycardia, paroxysmal ventricular tachycardia, partial atrioventricular block 2:1, prolonged
pr interval,right bundle branch block, shortened pr interval,sinus node dysfunction, supraventricular bigeminy,
supraventricular premature beats, supraventricular tachycardia, ventricular ectopic beats, ventricular escape
beat, ventricular escape rhythm, ventricular fibrillation, ventricular flutter, ventricular pacing pattern, ventricular
preexcitation, ventricular tachycardia, ventricular trigeminy, wandering atrial pacemaker, wolff parkinson white
pattern HYP trial hypertrophy, left atrial abnormality, left atrial enlargement, left atrial hypertrophy, left axis
deviation, left ventricular hypertrophy, left ventricular strain, r wave abnormal, right atrial abnormality, right atrial
hypertrophy, right axis deviation, right ventricular hypertrophy, ventricular hypertrophy
MI Acute myocardial infarction, Acute myocardial ischemia, Anterior ischemia, chronic myocardial ischemia, inferior
ischaemia, inferior st segment depression, lateral ischaemia, myocardial infarction, myocardial ischemia, old
myocardial infarction
STTC coronary heart disease, electrical alternans, high t voltage, nonspecific st t abnormality, s t changes, st depression,
st elevation, st interval abnormal, t wave abnormal, t wave inversion
A. Fib/ Aflutter Atrial fibrillation, Atrial fibrillation and flutter, Atrial flutter, chronic atrial fibrillation, paroxysmal atrial fibrillation,
rapid atrial fibrillation
Other Abnormal QRS, Atrial bigeminy, Blocked premature atrial contraction, Brugada syndrome, chronic rheumatic
pericarditis, decreased qt interval, early repolarization, ecg artefacts, fusion beats, heart failure, indeterminate
cardiac axis, isorhythmic dissociation, low qrs voltages, low qrs voltages in the limb leads, low qrs voltages in the
precordial leads, non-specific interatrial conduction block, nonspecific intraventricular conduction disorder, pacing
rhythm, paired ventricular premature complexes, premature atrial contraction, premature ventricular complexes,
premature ventricular contractions, prolonged qt interval, qwave abnormal, suspect arm ecg leads reversed, tall u
wave, transient ischemic attack, u wave abnormal, ventricular bigeminy
Table 6: CPSC Label Mappings
Class Name CPSC Labels Included
Normal sinus rhythm
CD 1st degree av block, atrial fibrillation, right bundle branch block, ventricular ectopics
HYP hypertrophy
MI MI
STTC st depression, st elevation
A. Fib/ Aflutter AF, AFIB
Other premature atrial contraction
Table 7: Class distributions for ECG datasets.
PTB-XL (source) Chapman-Shaoxing (target) Georgia (target) CPSC (target)
Class Training Split Validation Split Validation Split Validation Split Validation Split
Normal 9222 (52.77%) 2322 (53.24%) 1129 (55.05%) 725 (35.07%) 190 (13.8%)
Conduction Disturbance 1386 (7.93%) 348 (7.98%) 249 (12.14%) 240 (11.61%) 717 (52.07%)
Myocardial Infarction 1285 (7.35%) 333 (7.64%) 2 (0.098%) 82 (3.97%) 5 (0.36%)
Ischemic ST-T Changes 1661 (9.5%) 420 (9.63%) 260 (12.68%) 437 (21.14%) 213 (15.47%)
Other 1462 (8.37%) 360 (8.25%) 33 (1.61%) 263 (12.72%) 116 (8.42%)
Atrial fibrillation/atrial flutter 475 (2.72%) 103 (2.36%) 232 (11.31%) 2 (0.097%) 131 (9.51%)
Hypertrophy 1985 (11.36%) 475 (10.89%) 146 (7.12%) 318 (15.38%) 5 (0.36%)
Total # Examples 17476 4361 2051 2067 1377
Table 8: Class distributions for EEG datasets.
SHHS (source) ISRUC (target)
Class Training Split Validation Split Validation Split
Wake 1172690 (28.8%) 294869 (29.04%) 4814 (26.44%)
Non-REM Stage 1 152066 (3.74%) 38478 (3.79%) 2490 (13.68%)
Non- REM Stage 2 1668940 (41%) 411170 (40.5%) 5605 (30.78%)
Non-REM Stage 3 478497 (11.75%) 121076 (11.92%) 2944 (16.17%)
REM 598946 (14.71%) 149734 (14.75%) 2175 (11.95%)
Total # Examples 4071139 1015327 18208
17
Page 18:
Table 9: Class distributions for chest x-ray datasets.
MIMIC (source) CheXpert (target) VINDR-CXR (target)
Class (Multi-label) Training Split Occurrences Validation Split Occurrences Validation Split Occurrences Validation Split Occurrences
Atelectasis 1603 (20.04%) 425 (21.25%) 233 (31.74%) 86 (2.87%)
Cardiomegaly 1589 (19.86%) 445 (22.25%) 219 (29.84%) 309 (10.3%)
Consolidation 409 (5.11%) 108 (5.4%) 62 (8.45%) 96 (3.2%)
Edema 925 (11.56%) 294 (14.7%) 23 (3.13%) 10 (0.33%)
Pleural Effusion 1930 (24.13%) 576 (28.8%) 171 (23.29%) 111 (3.7%)
Total # Examples 8000 2000 734 3000
1. Rescale the array to pixel values between 0 and 255.
2.Invert the pixels if the “PhotometricInterpretation" attribute is set to “MONOCHROME1."
3. Save the pixel array as a JPEG using the Python Imaging Library (PIL).
During training, we load each JPG image, resize the longer size to 224 pixels while maintaining the
image’s aspect ratio, zero the mean using the training set mean, and pad the image with zeros to get
a 224×224 final grayscale image. We discard a handful of datapoints belonging to BI-RADS 0 or
BI-RADS 6, since these classes are not present in both datasets.
The distribution of classes for mammogram datasets is shown in Table 10.
VinDr-Mammo (source) CBIS-DDSM (target)
Class Training Split Validation Split Validation Split
BI-RADS 1 10724 (67.02%) 2682 (67.05%) 2 (0.54%)
BI-RADS 2 3742 (23.38%) 934 (23.35%) 15 (4.10%)
BI-RADS 3 744 (4.65%) 186 (4.65%) 78 (21.36%)
BI-RADS 4 610 (3.81%) 152 (3.8%) 188 (51.50%)
BI-RADS 5 180 (1.12%) 46 (1.15%) 82 (22.46%)
Total # Examples 16000 4000 365
Table 10: Class distributions for mammogram datasets.
C.5 Dermoscopic Images
Preprocessing During training, we load in each 3-channel JPG image, resize the longer size to 224
pixels while maintaining the image’s aspect ratio, perform per-channel standardization based on the
training set statistics, and pad the image with zeros to get a 224 ×224 final RGB image.
Task Standardization We reformulated the labels across each datasets to an unified 5 class
classification task: AKIEC (includes actinic keratoses, intraepithelial carcinoma, and squamous cell
carcinoma as all of these are with the continuum of squamous cell carcinoma), BCC (basal cell
carcinoma), MEL (melanoma), NEV (nevus), and Other diseases ( dermatofibroma, etc).
BCN20000 includes annotations for BCC, SCC, ACK, MEL, NEV , Dermatofibroma, Vascular lesion,
and seborrheic keratosis. We grouped SCC and ACK into AKIEC, and grouped Dermatofibroma,
Vascular lesion, and seborrheic keratosis into Other.
HAM10000 includes annotations for BCC, AKIEC, MEL, NV , BKL, Dermatofibroma, and V ASC. We
grouped BKL (benign keratosis-like lesions: solar lentigines / seborrheic keratoses), Dermatofibroma,
and V ASC into Other.
PAD-UFES-20 includes annotations for BCC, SCC, ACK, MEL, NEV , and Seborrheic Keratosis. We
grouped SCC and ACK into AKIEC, and grouped Seborrheic Keratosis into Other.
The distribution of classes for dermoscopic datasets is shown in Table 11.
C.6 Fundus Images
Preprocessing During training, we load each 3-channel JPG image, resize the longer size to 224
pixels while maintaining the image’s aspect ratio, perform per-channel standardization based on the
training set statistics, and pad the image with zeros to get a 224 ×224 final RGB image.
18
Page 19:
BCN 20000 (source) HAM 10000 (target) PAD-UFES-20 (target)
Class Training Split Validation Split Validation Split Validation Split
MEL 3618 (17.85%) 904 (17.84%) 223 (11.13%) 10 (2.18%)
NEV 10300 (50.83%) 2575 (50.83%) 1341 (66.95%) 49 (10.68%)
BCC 2658 (13.12%) 665 (13.13%) 103 (5.14%) 169 (36.82%)
AKIEC 1196 (5.9%) 299 (5.9%) 65 (3.25%) 184 (40.09%)
Other diseases 2493 (12.3%) 623 (12.3%) 271 (13.53%) 47 (10.24%)
Total # Examples 20265 5066 2003 459
Table 11: Class distributions for dermoscopic image datasets.
Task Standardization For each eye fundus image, we formulate a single-label task of predicting the
severity of diabetic retinopathy (DR) in the image using the International Clinic Diabetic Retinopathy
(ICDR) classification scale, which classifies DR on a five-stage severity scale from 0-4. The five
ratings in order of increasing severity are (0) no apparent retinopathy (NDR), (1) mild nonproliferative
retinopathy (NPDR), (2) moderate NPDR, (3) severe NPDR, and (4) proliferative diabetic retinopathy
(PDR). This is the scale used by the Messidor-2 and APTOS 2019 datasets. The scale can be
simplified into the modified Davis scale of three stages: NDR, simple diabetic retinopathy (SDR),
pre-proliferative retinopathy (PPDR), and PDR, with ICDR rating 0 corresponding to NDR, ICDR
ratings 1 and 2 corresponding to SDR, ICDR rating 3 corresponding to PPDR, and ICDR rating 4
corresponding to PDR. This is the label set used by the Jinchi Medical University dataset[ 53]. When
testing the performance of our model on this dataset, we first run prediction on the 5-class task.
Then if the target label is SDR and the predicted label is either 1 or 2, then we count it as a correct
prediction when computing AUROC.
The distribution of classes for fundus datasets is shown in Table 12.
Messidor-2 (source) APTOS 2019 (target) Jinchi (target)
Class Training Split Validation Split Validation Split Validation Split
Class 0 813 (58.32%) 204 (58.28%) 361 (49.24%) 1313 (66.01%)
Class 1 216 (15.49%) 54 (15.42%) 74 (10.09%)423 (21.26%)Class 2 277 (19.87%) 70 (20%) 200 (27.28%)
Class 3 60 (4.30%) 15 (4.28%) 39 (5.32%) 92 (4.62%)
Class 4 28 (2.01%) 7 (2%) 59 (8.04%) 161 (8.09%)
Total # Examples 1394 305 733 1989
Table 12: Class distributions for fundus image datasets.
C.7 LDCT
Preprocessing The LIDC data is distributed in DICOM file format. We perform the following
preprocessing step:
1. Extract “pixel_array" from the DICOM file.
2.Scale the pixel array by a factor of “RescaleSlope" attribute and add the value of the
“RescaleIntercept" to every pixel.
3. Resize each pixel array to 256 ×256.
4.Adjust pixel array to the PE viewing window (window_center=-600, window_width=1500).
We only keeping pixel values within a range of [window center + window width/2, window
center - window width/2].
5. Rescales the pixels into the range 0-1.
6. Save pixels in an HDF5 file to improve I/O.
The LNDb data is very different and stored in raw format. We perform the following preprocessing
steps:
1.Adjust pixel array to the viewing window (window_center=400, window_width=1000). We
only keep pixel values within a range of [window center + window width/2, window center -
window width/2].
2. Rescale the pixels into the range 0-1.
19
Page 20:
3. Resize each pixel array to 256 ×256.
4.Map 3d segmentations to the CT scans from real world coordinates to array level coordinates
and generate labels.
During training, we load a window of 24 slices from a study and center-crop the image to 224 ×224.
The distribution of classes for LDCT datasets is shown in Table 13.
LIDC-IDRI (source) LNDb (target)
Class (Multi label) Training Split Occurrences Validation Split Occurrences Validation Split Occurrences
Small Nodule Exists 36 (5.05%) 6 (3.97%) 81 (35.37%)
Large Nodule Exists 346 (48.53%) 84 (55.63%) 203 (88.65%)
Total # Examples 713 151 229
Table 13: Class distributions for LDCT datasets.
20