Page 1:
Early Concept Drift Detection via Prediction Uncertainty
Pengqian Lu1, Jie Lu1*, Anjin Liu1, Guangquan Zhang1
1Australian Artificial Intelligence Institute (AAII),
University of Technology Sydney, Ultimo, NSW 2007, Australia
{Pengqian.Lu@student., Jie.Lu@, Anjin.Liu@, Guangquan.Zhang@ }uts.edu.au
Abstract
Concept drift, characterized by unpredictable changes in data
distribution over time, poses significant challenges to machine
learning models in streaming data scenarios. Although error
rate-based concept drift detectors are widely used, they often
fail to identify drift in the early stages when the data distribu-
tion changes but error rates remain constant. This paper intro-
duces the Prediction Uncertainty Index (PU-index), derived
from the prediction uncertainty of the classifier, as a superior
alternative to the error rate for drift detection. Our theoretical
analysis demonstrates that: (1) The PU-index can detect drift
even when error rates remain stable. (2) Any change in the
error rate will lead to a corresponding change in the PU-index.
These properties make the PU-index a more sensitive and ro-
bust indicator for drift detection compared to existing methods.
We also propose a PU-index-based Drift Detector (PUDD) that
employs a novel Adaptive PU-index Bucketing algorithm for
detecting drift. Empirical evaluations on both synthetic and
real-world datasets demonstrate PUDD’s efficacy in detecting
drift in structured and image data.
Code — https://github.com/RocStone/PUDD
Introduction
In real-world applications, such as medical triage (Huggard
et al. 2020) or time series forecasting tasks (Miyaguchi
and Kajino 2019), the distribution of data may unpre-
dictably change over time. This phenomenon, termed concept
drift (Yuan et al. 2022), significantly degrades model perfor-
mance. Moreover, drift also manifests between clients and
servers in federated learning tasks (Jiang, Wang, and Dou
2022), or decision making process (Lu et al. 2020) further
complicating the learning process. Error rate-based drift de-
tection is one of the most popular approaches to handling
concept drift due to its efficiency (Lu et al. 2018a). It contin-
uously monitors the classifier’s error rate, issuing an alarm
when this rate exceeds a preset threshold (Raab, Heusinger,
and Schleif 2020; Frias-Blanco et al. 2014).
However, in the early stages of concept drift, a model’s
error rate may remain stable. In such a case, error rate-based
drift signals are unable to indicate the changes in data dis-
tribution. To address this, we explore alternative methods to
*†Corresponding author.
Copyright ©2025, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.detect distribution changes before a drop in error rate occurs.
Upon rethinking the logic behind error rate-based drift sig-
nals, we realize that the distribution of prediction probability
from a given model will change prior to the error rate itself.
In addition, if the error rate does change, then the distribution
of prediction probability must change too.
Prediction probability is just an example of a quantita-
tive measure of predictive uncertainty in models. We believe
that by clearly defining predictive uncertainty and demon-
strating that it is a superior alternative to error rates, we can
significantly enhance the sensitivity and robustness of drift
detection. Therefore, we introduce the Prediction Uncertainty
Index (PU-index), which measures the probability assigned
by a classifier that an instance does not belong to the true
class. Fig. 1 shows an illustrative example.
The objective of this paper is to address two key questions:
(1) Can we identify a more effective drift detection signal
that captures changes in data distribution when the error rate
remains stable? (2) Can we theoretically prove that if the
drift detection signal shows no significant change, then the
model’s error rate will also remain stable? In other words,
we seek to develop a superior drift detection signal that can
detect drift when the error rate cannot, and to prove that if
the new signal fails to detect drift, the error rate will also
fail to do so. We believe that these criteria are essential for
evaluating and comparing different drift detection signals.
The theoretical results provided in this paper show strong
evidence for the superiority of the PU-index as a metric for
concept drift detection. It exhibits at least equivalent sensitiv-
ity to error-based metrics and potentially higher sensitivity in
certain scenarios, rendering it a more robust and comprehen-
sive measure for identifying concept drift.
However, it could be argued that a more sensitive detection
method might result in a higher false alarm rate, potentially
overreacting to minor fluctuations in model performance. If
there is no significant drift in the error rate, why should we
still seek to detect it? To mitigate such concern, we employ
the Chi-square test, a robust statistical significance test, to the
PU-index for drift detection. This approach helps distinguish
between meaningful distributional shifts and inconsequential
variations, thereby maintaining the benefits of increased sen-
sitivity while minimizing false alarms. The p-value obtained
from the Pearson’s Chi-square test serves as a precise control
mechanism for our tolerance to false alarms. By adjusting thearXiv:2412.11158v1 [cs.LG] 15 Dec 2024
Page 2:
Figure 1: Illustrative example of the early stage of concept drift when an error rate-based detector fails to detect concept drift,
but a prediction uncertainty-based detector can. The data around decision boundaries have been highlighted in the middle of
the figure, showing the distribution gap between test sets 1 and 2. Such a gap implies concept drift occurrence. However, in
this case, the error rates of the two test sets are the same. We also provide a theoretical proof in the Appendix to demonstrate
the existence of such a case. By contrast, the distribution of prediction uncertainty has changed. The example implies that a
prediction uncertainty-based detector can detect drift when an error rate-based detector fails.
significance level ( α), we can directly modulate the trade-off
between sensitivity and false positive rate.
In this paper, we introduce PUDD, a drift detector that
uses the Prediction Uncertainty (PU) index to identify con-
cept drift. PUDD employs a sliding window approach to
remove outdated data and split the historical stream into two
samples. It then applies an Adaptive PU-index Bucketing
algorithm to automatically construct histograms that meet
our theoretical conditions. Using these histograms, we apply
Pearson’s Chi-square test to determine if drift has occurred.
Our main contributions are:
1.To the best of our knowledge, this is the first systematic
study that compares two different drift detection signals
with theoretical analysis.
2.We propose a novel drift detection metric called the PU-
index, which is theoretically proven always to outperform
error rate-based drift measurements. This provides crucial
insight into the PU-index as a more sensitive and robust
alternative for concept drift detection.
3.To identify concept drift in streaming data through the PU-
index, we propose a Prediction Uncertainty index base
drift detector. It comprises an Adaptive PU-index Buck-
eting algorithm to build a histogram for the PU-index,
which meets the condition of our theoretical analysis, to
conduct the Pearson’s Chi-square test and detect drift.
Literature Review
In this section, we examine two approaches to concept
drift detection: data distribution-based and error rate-based
methods. The former directly addresses the root cause of
drift—changes in the data distribution—while the latter fo-cuses on variations in the model’s performance, often achiev-
ing higher computational efficiency.
Data Distribution-based Methods
Data distribution-based methods measure shifts in the un-
derlying distribution. For instance, a statistical density es-
timation approach is proposed in (Song et al. 2007), en-
abling the quantification of differences between two samples.
Histogram-based techniques frequently serve to represent dis-
tributions in high-dimensional feature spaces (Liu et al. 2017).
For example, (Boracchi et al. 2018) and (Yonekawa, Saito,
and Kurokawa 2022) introduce hierarchical and dynamically
adjustable strategies to construct histograms, respectively.
Interval formation can also rely on methods like QuadTree
(Coelho, Torres, and de Castro 2023) and K-means clustering
(Liu, Lu, and Zhang 2020). Beyond direct histogram or den-
sity estimation, some methods incorporate contextual factors
(Lu et al. 2018b) or anticipate future distributions. For ex-
ample, (Cobb and Van Looveren 2022) uses a context-based
CoDiTE function (Park et al. 2021) to detect drift, while (Li
et al. 2022) exploits a predictive model for future distribu-
tions. There are also approaches that leverage Graph Neural
Networks to track and adapt to distribution changes directly
(Zhou et al. 2023). Although effective, these strategies can be
computationally expensive in high-dimensional data streams
(Souza et al. 2021). To date, no existing work uses histograms
constructed from prediction uncertainty for drift detection,
leaving a notable gap in the literature. To the best of our
knowledge, there is no previous work that proposes building
a histogram of prediction uncertainty to detect concept drift.
Page 3:
Error rate-based methods
Error rate-based detectors are well-studied and computation-
ally efficient. Approaches such as (Gama et al. 2004), (Baena-
Garcıa et al. 2006), and (Frias-Blanco et al. 2014) monitor
variations in model error rates to detect drift. Adaptive win-
dow resizing is explored in (Bifet and Gavalda 2007), and
forgetting mechanisms are introduced in (Jiao et al. 2022) to
weight classifiers dynamically. More recent strategies apply
Gaussian Mixture Models to compare windows (Yu et al.
2024) or enter reactive states upon detecting alarms (Tah-
masbi et al. 2021). Despite their efficiency, error rate-based
detectors struggle to identify drift when accuracy remains sta-
ble, particularly during its early stages (see Fig. 1). To address
this limitation, we propose a prediction uncertainty-based ap-
proach, capable of detecting shifts even before error rates
degrade. This method enhances early detection capabilities
and complements existing drift detection strategies.
Preliminaries
Pearson’s Chi-square test
The Pearson’s Chi-square test assesses whether two categor-
ical variables are independent. Its null hypothesis assumes
independence, and if the computed p-value falls below a
chosen significance level, the null hypothesis is rejected, in-
dicating potential dependence between the variables.
The test’s reliability depends on having sufficiently large
observed and expected frequencies. As noted in (EP 1978),
the Chi-square test produces valid results when each observed
count exceeds 50 and every expected count exceeds 5. Under
these conditions, the distribution of the test statistic closely
approximates a normal distribution, enhancing the test’s va-
lidity. The test relies on a contingency table. For the cell
in the i-th row and j-th column, Oijdenotes the observed
frequency, and its expected frequency is given by:
Eij=ni×nj
N, (1)
where niandnjare the cumulative frequencies of the re-
spective row and column, and Nis the sum of the table. The
Chi-square test statistic is derived as:
χ2=X
iX
j
O2
ij
Eij!
−N. (2)
Correspondingly, the p-value associated with χ2is:
p= 1−Zχ2
0xw
2−1·e−x
2
2w
2·Γ w
2dx, (3)
where w indicates the degrees of freedom, calculated as:
w= (number of columns −1)×(number of rows −1).(4)
Space Partitioning Algorithms
Space partitioning algorithms are widely studied to build his-
tograms to establish density estimators (Silverman 2018). The
key is to split feature space into partitions and count the in-
stances falling into it to build a histogram. The partitions canbe built by QuanTree (Boracchi et al. 2018), Kernel Quant-
Tree (Stucchi et al. 2023), or Neural Network (Yonekawa,
Saito, and Kurokawa 2022). Particularly, we introduce the
Ei-kMeans space partitioning algorithm (Liu, Lu, and Zhang
2020), which can automatically determine the number and
size of partitions.
Given two samples AandB, Ei-kMeans initializes cen-
troids by iteratively selecting N/K points from a copy of A,
where N=|A|andKis the hyperparameter of the kMeans
algorithm. Each iteration: (1) Select the point in Awith the
largest 1-NN distance as zi. (2) Removes ziand its N/K -
nearest neighbors from A. This process repeats N/K times,
yielding N/K initial centroids. Then the kMeans algorithm
is applied to Awith the initial centroids to derive Kclusters
denoted as {Ci|i∈[1,N
K]}.
To ensure that the number of examples in each cluster
is larger than 5 to be able to conduct the Chi-square test,
an amplify-shrink algorithm is proposed to adjust the num-
ber of examples in each cluster. Let us denote the number
of instances in the clusters as V={Ci||i∈[1,N
K]}. The
distance matrix between the examples in Aand the cen-
ters of each cluster is denoted as Mdist∈RN×K, which
is amplified by: Mdist=Mdist⊙
1·eθ·(V
N−1)
,where
1∈RN×1is an all-ones matrix, θdenotes the hyperparam-
eter controlling the shape of the coefficient function, . Let
Mij
distdenote the amplified distance between i-th data Ai
andj-th center cj, the assigned cluster for Aiis defined as
yi= arg min j=1...KMij
dist.
After the amplify-shrink algorithm, the final clusters are
derived and can be considered as subspaces in the feature
space of A. The numbers of examples of AandBin the
clusters are counted to form a histogram.
Methodology
This section sets up the problem and provides a theoretical
analysis demonstrating the advantages of the PU-index over
error rate-based methods. We then introduce a novel slid-
ing window strategy and an Adaptive PU-index Bucketing
algorithm for concept drift detection and adaptation.
Problem Setup
Formally, we represent the streaming data collected dur-
ing the period [1, t]asD1,t={(xj, yj)|j∈[1, t]}. If
the data is collected in chunks, then the stream includes
a set of chunks D1,t={¯Dj|j∈[1, t]}, where each chunk
¯Dj={(xjk, yjk)|k∈[1, M]}includes Mexamples. Here,
xjkrepresents an instance with ddimensional attributes, yjk
denotes the corresponding label, and Mdenotes the chunk
size. In this paper, we focus only on the data collected in
chunks. If the stream D1,tfollows a distribution P1,t(x, y),
following (Lu et al. 2018a), we claim that a drift occurs at
timet+ 1if
P1,t(x, y)̸=Pt,∞(x, y). (5)
The goal of concept drift detection is to raise an alarm
at time t+ 1 when the distribution of data changes. The
most popular metric for detecting the distribution change is
prediction error. For a classifier f, assuming the number of
Page 4:
Figure 2: The framework of our proposed algorithm. The sliding window strategy has two components, i.e., antiquated data
discard and cutting point exploration as shown on the left. The Adaptive PU-index Bucketing algorithm is shown in the middle.
The drift detection process is shown on the right.
classes is n, the classifier will output a prediction probability
for each class ˆy∈RnandPn
i=1ˆy= 1. The prediction error
of an instance xiis defined as:
ei=I(ˆyj= arg max
jfj(xi)̸=yi), (6)
where I(·)is the indicator function.
As we mentioned earlier, our motivation is that the predic-
tion probability will intuitively change before the prediction
error when drift occurs. In this paper, we measure the predic-
tion probability by the PU-index which is defined as:
ui= 1−fyi(xi), (7)
where fyi(xi)denotes the probability predicted by the classi-
fier that xibelongs to the ground truth class yi.
Theoretical Analysis of Error Rate and PU-index
To rigorously evaluate the efficacy of these two metrics for
concept drift detection, we conduct a theoretical comparison
from two complementary perspectives. (1) When the PU-
index distribution remains stable, potentially failing to detect
concept drift, we investigate whether the error rate distribu-
tion exhibits changes that could indicate drift. (2) Conversely,
when the error rate distribution remains constant, we examine
whether the PU-index distribution demonstrates changes that
might reveal underlying changes in the data stream.
Theorem 1. LetW1andW2be two windows of a data stream
in a multi-class classification problem. If their respective
PU-index histograms H1andH2are identical, where the
histograms are constructed such that the first bin contains
all misclassified instances and the remaining bins partition
the misclassified instances, then the error rates and error
standard deviations of W1andW2are equal.Theorem 2. Given a multi-class classification problem, if
two windows have equal error standard deviations or error
rates, their PU-index histograms, where the first bin contains
all correctly classified instances and the remaining bins par-
tition the misclassified instances, may not have identical bin
proportions.
Due to the page limit, the proofs are provided in the Ap-
pendix. These theorems lead to the following conclusions: (1)
Theorem 1 demonstrates that when the PU-index distribution
remains stable, the error rate and the error standard deviation
also remain constant. This implies that if the PU-index fails
to detect concept drift, error-based metrics will also fail to
detect it. (2) Theorem 2 establishes that even when error rates
and error standard deviations are equal between two win-
dows, the PU-index distributions may differ. This suggests
that the PU-index has the potential to detect subtle changes
in the data distribution that are not captured by traditional
error-based metrics. These findings show that the PU-index
offers at least the same sensitivity as error-based metrics and
potentially higher sensitivity in certain scenarios, making
it a more robust and comprehensive measure for detecting
concept drift in streaming data environments.
Sliding Window Strategy and Adaptive PU-index
Bucketing Algorithm
To detect concept drift using the PU-index without making it
overly sensitive, we apply the Chi-square test to examine the
PU-index distribution. The hypotheses are:
Null Hypothesis ( H0): The PU-index distribution does
not change over time, indicating no concept drift.
Alternative Hypothesis ( H1): The PU-index distribution
changes over time, indicating concept drift.
If the Chi-square statistic exceeds the critical value at the
chosen significance level, we reject H0and conclude that
Page 5:
concept drift has occurred. Otherwise, we detect no signif-
icant drift. Thus, detecting concept drift using Chi-square
involves two key steps: (1) partitioning the data stream into
two windows and (2) constructing a histogram of the col-
lected PU-index values.
We adopt a sliding window strategy to handle online
streaming data. Let D1,t={¯Dj|j∈[1, t]}denote PU-index
chunks collected from the start of the stream, potentially
from different distributions. For instance, suppose D1,t1and
Dt1,tdiffer in distribution. If a new chunk ¯Dt+1matches the
distribution of Dt1,t, no drift should be detected. However,
keeping antiquated data D1,t1could trigger a false alarm,
since D1,t1andDt1,t+1differ in distribution.
To solve this ”antiquated distribution” problem, we dis-
card outdated data after detecting a drift at time t1. Subse-
quent drift detection uses only Dt1,t+1, avoiding false alarms
caused by old data. After discarding antiquated data, we
must determine how to form two windows on the current
substream. We do this by exploring all possible cutting points
r∈[t1, t+ 1]. Thus, Dt1,t+1is split into Dt1,randDr,t+1
for the Adaptive PU-index Bucketing algorithm. The sliding
window is illustrated on the left side of Fig. 2.
Theoretical analysis shows that misclassified instances’
PU-indices must be grouped into the same bin. For counter-
part, we use Ei-kMeans to form bins that meet the Chi-square
test requirements. We call this the Adaptive PU-index Buck-
eting algorithm, illustrated in the middle of Fig. 2.
PU-index based Drift Detector
In this subsection, we introduce the overall of our method.
Firstly, given a substream containing the recent chunks’ PU-
index ut1,t, we explore all cutting points r∈[t1, t]. Based
on the cutting points, we have t−t1window pairs, denoted
asut1,r, andur,t. Then we defined the PU-index pairs for
correctly and wrongly classified instances as:
uC
t1,r={u|ui∈ut1,r∧ˆyi=yi}, (8)
uC
r,t={u|ui∈ur,t∧ˆyi=yi}, (9)
uM
t1,r={u|ui∈ut1,r∧ˆyi̸=yi}, (10)
uM
r,t={u|ui∈ur,t∧ˆyi̸=yi}. (11)
These four equations denote the PU-index of correctly clas-
sified instances on the first and second window, and the PU-
index of misclassified instances on the first and second win-
dow respectively. Next, we compute the contingency table
T∈R2×(K+1), where Kis a hyperparameter in Ei-kMeans.
To calculate T, firstly, we apply the Adaptive PU-index Buck-
eting algorithm on uC
t1,rto build a histogram. Then we count
the instances from uC
t1,rfalling into the bins of the histogram
and fill them into T1iwhere i∈[1, K]. Likewise, we count
the examples of uC
r,tfalling into the previously obtained bins
and fill them into T2i. Finally, we fill the T1,K+1, andT2,K+1
by the size of uM
t1,randuM
r,t. Then we define the expected
frequency of Tijas:
Eij=PK+1
j=1Tij×P2
i=1TijP
ijTij. (12)
Figure 3: Illustrative example of generating CIFAR-10-CD
through the Markov process. The classes marked in red rep-
resent the user’s initial interest and are considered positive
labels. All other classes are considered negative labels.
The Chi-square test statistic is defined as:
χ2=X
iX
j
T2
ij
Eij!
−X
ijTij. (13)
Finally, the p-value is computed by:
p= 1−Zχ2
0xK
2−1·e−x
2
2K
2·Γ K
2dx, (14)
Based on the Equation (12-14), we can compute the p-value
for each window pair Dt1,r, and Dr,t. If the minimum p-
value among all window pairs is smaller than a predefined
threshold σ, we raise a drift detected alarm. It is important
to clarify that the specified threshold controls the Type I
error rate for each individual window pair test rather than the
overall Type I error across the entire substream. Consequently,
we do not employ multiple comparison adjustments since our
statistical guarantees apply at the single-test level rather than
the family-wise level. The pseudo-code and time complexity
analysis is provided in the Appendix.
Experiments
In this section, we introduce the settings and results of the
experiments in our paper. The details of implementation,
datasets, baselines, and the critical difference diagrams (Is-
mail Fawaz et al. 2019) for the experiments in this paper are
introduced in the Appendix.
Datasets and Baselines
We propose CIFAR-10-CD, a synthetic concept drift im-
age dataset with an illustrative example in Fig. 3, to sim-
ulate user interests changing via a Markov process. Initially,
three CIFAR-10 classes are marked positive, with interest
shifts occurring probabilistically (e.g. 1% chance of Plane to
Horse transfer). Our experiments utilize 3 real-world datasets
(airline(Ikonomovska 2011), elec2(Harries 1999), powersup-
ply(Dau et al. 2019)) and 4 synthetic sets (sine(Gama et al.
Page 6:
Incremental Training Training only at Initialization or Adaptation
Classifier ddm name airline-I elec2-I mixed-I ps-I sea0-I sine-I airline-O elec2-O mixed-O ps-O sea0-O sine-O
DNNADWIN 61.65 71.94 78.45 71.12 97.70 79.27 59.36 69.71 84.41 65.95 95.18 86.59
DDM 61.29 71.02 80.72 71.37 96.86 86.41 56.79 69.51 83.03 69.35 93.55 80.32
EDDM 62.13 70.39 78.01 65.21 96.89 75.34 61.17 69.22 71.16 67.57 92.18 76.54
HDDM-A 62.70 71.16 76.70 68.97 97.84 81.65 59.13 68.98 84.27 67.81 95.11 86.82
HDDM-W 61.50 71.23 77.07 68.74 97.84 85.43 62.42 68.48 84.32 66.89 92.24 86.57
KSWIN 63.02 70.56 78.58 70.80 97.88 79.00 61.34 69.08 84.43 66.46 91.60 83.66
PH 62.15 72.25 79.49 70.86 97.73 78.07 60.36 68.24 84.36 68.83 95.21 86.88
PUDD-1 63.31 74.92 77.39 72.25 97.94 86.19 60.90 69.35 82.65 71.47 94.89 83.39
PUDD-3 63.21 74.93 80.05 72.23 98.04 85.12 60.16 68.98 84.65 70.37 95.99 84.97
PUDD-5 63.35 74.92 82.81 72.24 98.23 82.51 60.19 68.68 84.90 70.20 96.29 85.09
GNBADWIN 50.17 68.90 83.95 70.06 94.18 82.49 54.66 68.30 83.62 68.78 93.93 82.45
DDM 52.94 67.75 83.82 69.63 93.97 82.07 52.43 67.60 83.59 67.52 93.48 81.90
EDDM 62.72 67.73 83.19 70.04 94.06 83.12 54.11 67.64 74.65 70.04 94.06 73.88
HDDM-A 52.80 67.73 83.92 70.87 94.28 83.25 55.62 67.73 83.66 71.24 93.96 83.36
HDDM-W 48.66 67.73 83.91 71.06 92.11 82.83 48.62 67.71 83.60 69.53 91.63 83.25
KSWIN 49.84 67.87 83.92 71.23 91.72 81.88 48.83 67.63 83.59 67.99 90.02 81.32
PH 49.35 70.12 83.88 70.36 94.34 83.54 49.02 70.04 83.61 68.67 94.12 83.10
PUDD-1 53.57 70.85 82.99 71.88 94.61 83.12 51.05 62.76 79.23 71.13 94.25 81.48
PUDD-3 53.03 70.85 83.92 71.59 94.81 83.39 49.45 59.32 83.58 71.20 94.60 83.80
PUDD-5 52.16 70.69 84.12 71.59 94.85 83.43 54.37 59.44 83.96 70.40 94.62 83.38
VFDTADWIN 60.39 73.98 84.30 71.69 94.77 87.11 61.22 74.29 83.54 68.78 92.96 85.74
DDM 60.16 74.82 84.14 70.68 94.86 86.50 59.28 74.75 82.31 67.53 93.63 82.23
EDDM 61.19 73.81 83.15 70.06 94.17 85.59 62.31 73.81 73.73 70.06 93.60 77.72
HDDM-A 60.95 73.90 84.40 70.84 95.20 87.53 60.29 73.83 83.66 71.24 93.93 85.21
HDDM-W 61.11 73.80 84.40 70.99 93.50 87.45 61.92 73.73 83.59 69.54 91.76 85.28
KSWIN 61.30 74.10 84.42 71.27 93.40 86.22 62.06 74.10 83.57 67.51 89.84 82.60
PH 60.95 73.70 83.56 70.88 94.69 87.15 60.97 73.99 83.30 68.67 93.85 85.19
PUDD-1 61.38 73.86 84.25 71.77 95.13 87.33 61.16 69.79 82.13 71.13 94.10 82.21
PUDD-3 61.57 73.84 83.94 71.79 95.21 87.42 59.90 69.79 84.04 71.20 94.63 85.81
PUDD-5 61.57 73.64 84.01 71.79 95.24 87.63 57.04 71.83 84.16 70.40 94.56 86.01
Table 1: Comparative analysis against classic drift detectors across 3 synthetic and 3 real-world datasets. The top 3 results are
highlighted in bold and the top 1 results are in both bold and underlined. PUDD- xrepresents the threshold set as 10−xfor our
method. The ps is short of powersupply dataset. Results for dataset sea10 and sea20 is provided in Appendix.
2004), mixed(Gama et al. 2004), CIFAR-10-CD, sea vari-
ants(Bifet et al. 2010)). We compare against 7 classic detec-
tors (ADWIN(Bifet and Gavalda 2007), DDM(Gama et al.
2004), EDDM(Baena-Garcıa et al. 2006), HDDM-A(Frias-
Blanco et al. 2014), HDDM-W(Frias-Blanco et al. 2014),
KSWIN(Raab, Heusinger, and Schleif 2020), PH(Sebasti ˜ao
and Fernandes 2017)) and 5 SOTA methods (MCDD(Wan,
Liang, and Yoon 2024), AMF(Mourtada, Ga ¨ıffas, and Scor-
net 2021), IWE(Jiao et al. 2022), NS(Wang et al. 2021), and
ADLTER(Wang et al. 2022)).
Comparison with Baselines and Ablation Studies
In this subsection, we compare our method with 7 classic drift
detectors and 5 SOTA methods on 9 datasets (including a vari-
ant of the SEA dataset). Due to page constraints, results for
SEA10 and SEA20 appear in the Appendix. We evaluate all
methods using three classifiers—DNN (architecture detailed
in the Appendix), Gaussian Naive Bayes (GNB) (Virtanen
et al. 2020), and VFDT (Hulten, Spencer, and Domingos
2001)—under two training regimes: incremental (dataset-I)and one-time training at an alarm (dataset-O).
Results for the comparison with classic detectors are pre-
sented in Table 1, and those for SOTA methods are given in
Table 2. For CIFAR-10-CD, due to its learning complexity,
we only use incremental training and report results in Fig. 4.
Our method is denoted as PUDD-X , where X represents the
exponent in 10−X. Based on these experiments, we derive
6 observations. We introduce 4 of them here and leave the
remaining 2 in the Appendix.
Observation 1: our method shows stronger perfor-
mance compared to classic drift detectors as evidenced
by the results presented in Table 1, Fig. 4, and additional
results in Appenidx. In incremental learning settings, PUDD
ranks first in 17 out of 24 cases across different datasets and
classifiers, and it is in the top 3 in 20 out of 24 cases. When
trained only initially or with adaptation, it still performs well.
It achieves first rank in 15 cases and top 3 in 19 cases. This
shows that PUDD is particularly effective with incremental
training. Results in Fig. 4 show PUDD outperformed all the
baselines, which demonstrates the superiority of our method
Page 7:
ddm name airline elec2 mixed ps sea0 sine
AMF 38.56 66.24 49.49 69.63 93.67 49.52
IWE 38.02 68.90 49.47 64.10 93.14 49.51
NS 67.91 76.42 81.09 72.39 93.54 91.01
ADLTER 70.00 76.10 87.63 72.48 93.40 92.18
MCD-DD 63.65 69.81 86.68 71.66 97.66 90.21
PUDD-1 63.78 77.28 89.51 72.68 98.47 94.52
PUDD-3 64.62 76.77 89.47 72.79 98.44 94.76
PUDD-5 64.45 76.92 89.37 72.74 98.49 90.90
Table 2: Comparison with SOTA methods. The dataset ps is
short for powersupply. The results for dataset sea10 and sea20
is provided in Appnedix. The table shows that our methods
PUDD in achieved top-1 in 5 out of 6 datasets, implying the
effectiveness of PUDD compared with SOTA methods.
in detecting the concept drift in the image dataset.
Observation 2: PUDD performs better with a smaller
threshold as revealed in Table 1 and additional results in Ap-
pendix. In incremental learning scenarios, PUDD-1, PUDD-3,
and PUDD-5 achieve top 1 in 5, 5, and 8 cases respectively.
When tested in training only once until alarm way, PUDD-1,
PUDD-3, and PUDD-5 achieved top 1 in 2, 6, and 8 cases re-
spectively. PUDD consistently shows improved performance
at lower thresholds in both scenarios. As detailed in the Sen-
sitivity of PU-index section, a drift alarm triggers when the p-
value is below the threshold, with lower thresholds indicating
stricter conditions for alarm detection. Therefore, PUDD’s
better performance with smaller thresholds suggests a high
sensitivity to drift.
Observation 3: PUDD shows very competitive perfor-
mance compared to SOTA methods. As shown in Table 2
and additional results in Appendix, our method attains the
top rank in 7 out of 8 cases. On certain datasets, this im-
provement is particularly pronounced. For instance, PUDD-5
achieves a 98.49% accuracy, which is 2.8% higher than the
best SOTA method. The only exception occurs in the airline
dataset, where NS and ADLTER outperform PUDD.
This discrepancy can be explained by the airline dataset’s
tabular nature and its numerous attributes, which are more
effectively modeled through tree-based ensemble learning
utilized by these SOTA methods. Moreover, these methods
adapt to drift by adjusting ensembles rather than discarding
and retraining them. In contrast, PUDD relies on retraining
classifiers solely on recent data, which may not be suitable for
attribute-rich datasets like the airline dataset. Nevertheless,
for all other datasets, the results confirm that PUDD surpasses
SOTA methods, thereby underscoring its overall superiority.
Observation 4: The Adaptive PU-index Bucketing algo-
rithm outperforms Ei-kMeans. Figure 5 shows that PUDD
surpasses Ei-kMeans across various datasets, classifier train-
ing methods, and threshold settings. The critical difference
diagram in the Appendix shows that improvements at thresh-
olds10−3and10−5are statistically significant.
In summary, these results confirm the theoretical bene-
fits of the PU-index for drift detection. PUDD outperforms
both classic and SOTA detectors, and the Adaptive PU-index
Bucketing algorithm shows significant improvements over
Figure 4: Comparison with baselines on CIFAR-10-CD, ex-
cluding methods unable to detect drift in image datasets.
Figure 5: Accuracy comparison between PUDD (using Adap-
tive PU-index Bucketing) and Ei-kMeans (EK). We show
average accuracy across 9 datasets using 3 classifiers.
Ei-kMeans. This validates the PU-index as a sensitive, robust
indicator capable of detecting drift even when error rates
remain unchanged, thereby overcoming a major shortcoming
of error rate-based approaches.
Conclusion and Future Work
In our study, we demonstrated that the PU-index, as opposed
to the error rate, is a more effective measure for detecting
concept drift in machine learning models. We utilized the
Adaptive PU-index bucketing algorithm to partition the PU-
index and the Chi-square test to detect concept drift. We also
introduced a technique for inducing concept drift in image
datasets by simulating changes in user interest. We validated
our method through experiments on both synthetic and real-
world datasets. Future work should focus on automating drift
alarm threshold determination, as current methods rely on
manual settings that may not remain optimal over time. Our
research also uncovers a method for generating multi-stream
concept drift in image datasets by emulating shifts in user
interests using Markov matrices, offering valuable insights
for research in multistream concept drift learning.
Acknowledgments
This work was supported by the Australian Research
Council through the Laureate Fellow Project under Grant
FL190100149.
Page 8:
References
Baena-Garcıa, M.; del Campo- ´Avila, J.; Fidalgo, R.; Bifet,
A.; Gavalda, R.; and Morales-Bueno, R. 2006. Early drift de-
tection method. In Fourth international workshop on knowl-
edge discovery from data streams , volume 6, 77–86. Citeseer.
Bifet, A.; and Gavalda, R. 2007. Learning from time-
changing data with adaptive windowing. In Proceedings
of the 2007 SIAM international conference on data mining ,
443–448. SIAM.
Bifet, A.; Holmes, G.; Pfahringer, B.; Kranen, P.; Kremer, H.;
Jansen, T.; and Seidl, T. 2010. Moa: Massive online analysis,
a framework for stream classification and clustering. In
Proceedings of the first workshop on applications of pattern
analysis , 44–50. PMLR.
Boracchi, G.; Carrera, D.; Cervellera, C.; and Maccio, D.
2018. QuantTree: Histograms for change detection in multi-
variate data streams. In International Conference on Machine
Learning , 639–648. PMLR.
Cobb, O.; and Van Looveren, A. 2022. Context-aware drift
detection. In International conference on machine learning ,
4087–4111. PMLR.
Coelho, R. A.; Torres, L. C. B.; and de Castro, C. L. 2023.
Concept Drift Detection with Quadtree-based Spatial Map-
ping of Streaming Data. Information Sciences .
Dau, H. A.; Bagnall, A.; Kamgar, K.; Yeh, C.-C. M.; Zhu,
Y .; Gharghabi, S.; Ratanamahatana, C. A.; and Keogh, E.
2019. The UCR time series archive. IEEE/CAA Journal of
Automatica Sinica , 6(6): 1293–1305.
EP, B. G. 1978. Statistics for Experimenters: An Introduction
to Design. Data Analysis, and Model Building , 43–48.
Frias-Blanco, I.; del Campo- ´Avila, J.; Ramos-Jimenez, G.;
Morales-Bueno, R.; Ortiz-Diaz, A.; and Caballero-Mota, Y .
2014. Online and non-parametric drift detection methods
based on Hoeffding’s bounds. IEEE Transactions on Knowl-
edge and Data Engineering , 27(3): 810–823.
Gama, J.; Medas, P.; Castillo, G.; and Rodrigues, P. 2004.
Learning with drift detection. In Advances in Artificial
Intelligence–SBIA 2004: 17th Brazilian Symposium on Arti-
ficial Intelligence, Sao Luis, Maranhao, Brazil, September
29-Ocotber 1, 2004. Proceedings 17 , 286–295. Springer.
Harries, M. 1999. Splice-2 comparative evaluation: electricity
pricing (technical report unsw-cse-tr-9905). Artificial Intelli-
gence Group, School of Computer Science and Engineering,
The University of New South Wales, Sydney , 2052.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition , 770–
778.
Huggard, H.; Koh, Y . S.; Dobbie, G.; and Zhang, E. 2020.
Detecting concept drift in medical triage. In Proceedings of
the 43rd International ACM SIGIR Conference on Research
and Development in Information Retrieval , 1733–1736.
Hulten, G.; Spencer, L.; and Domingos, P. 2001. Mining time-
changing data streams. In Proceedings of the seventh ACM
SIGKDD international conference on Knowledge discovery
and data mining , 97–106.Ikonomovska, E. 2011. Airline dataset. URL http://kt. ijs.
si/elena ikonomovska/data. html.(Accessed on 02/06/2020) .
Ismail Fawaz, H.; Forestier, G.; Weber, J.; Idoumghar, L.;
and Muller, P.-A. 2019. Deep learning for time series classi-
fication: a review. Data Mining and Knowledge Discovery ,
33(4): 917–963.
Jiang, M.; Wang, Z.; and Dou, Q. 2022. Harmofl: Harmo-
nizing local and global drifts in federated learning on het-
erogeneous medical images. In Proceedings of the AAAI
Conference on Artificial Intelligence , volume 36, 1087–1095.
Jiao, B.; Guo, Y .; Yang, C.; Pu, J.; Zheng, Z.; and Gong, D.
2022. Incremental Weighted Ensemble for Data Streams with
Concept Drift. IEEE Transactions on Artificial Intelligence .
LeCun, Y .; Bengio, Y .; and Hinton, G. 2015. Deep learning.
nature , 521(7553): 436–444.
Li, W.; Yang, X.; Liu, W.; Xia, Y .; and Bian, J. 2022. Ddg-
da: Data distribution generation for predictable concept drift
adaptation. In Proceedings of the AAAI Conference on Artifi-
cial Intelligence , volume 36, 4092–4100.
Liu, A.; Lu, J.; and Zhang, G. 2020. Concept drift detec-
tion via equal intensity k-means space partitioning. IEEE
transactions on cybernetics , 51(6): 3198–3211.
Liu, A.; Song, Y .; Zhang, G.; and Lu, J. 2017. Regional
concept drift detection and density synchronized drift adap-
tation. In IJCAI International Joint Conference on Artificial
Intelligence .
Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; and Zhang, G.
2018a. Learning under concept drift: A review. IEEE trans-
actions on knowledge and data engineering , 31(12): 2346–
2363.
Lu, J.; Liu, A.; Song, Y .; and Zhang, G. 2020. Data-driven
decision support under concept drift in streamed big data.
Complex & intelligent systems , 6(1): 157–163.
Lu, J.; Xuan, J.; Zhang, G.; and Luo, X. 2018b. Structural
property-aware multilayer network embedding for latent fac-
tor analysis. Pattern Recognition , 76: 228–241.
Miyaguchi, K.; and Kajino, H. 2019. Cogra: Concept-drift-
aware stochastic gradient descent for time-series forecasting.
InProceedings of the AAAI Conference on Artificial Intelli-
gence , volume 33, 4594–4601.
Montiel, J.; Halford, M.; Mastelini, S. M.; Bolmier, G.;
Sourty, R.; Vaysse, R.; Zouitine, A.; Gomes, H. M.; Read,
J.; Abdessalem, T.; et al. 2021. River: machine learning
for streaming data in python. Journal of Machine Learning
Research , 22(110): 1–8.
Mourtada, J.; Ga ¨ıffas, S.; and Scornet, E. 2021. AMF: Ag-
gregated Mondrian forests for online learning. Journal of the
Royal Statistical Society Series B: Statistical Methodology ,
83(3): 505–533.
Park, J.; Shalit, U.; Sch ¨olkopf, B.; and Muandet, K. 2021.
Conditional distributional treatment effect with kernel con-
ditional mean embeddings and U-statistic regression. In
International Conference on Machine Learning , 8401–8412.
PMLR.
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V .;
Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,
Page 9:
R.; Dubourg, V .; Vanderplas, J.; Passos, A.; Cournapeau, D.;
Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-
learn: Machine Learning in Python. Journal of Machine
Learning Research , 12: 2825–2830.
Raab, C.; Heusinger, M.; and Schleif, F.-M. 2020. Reactive
soft prototype computing for concept drift streams. Neuro-
computing , 416: 340–351.
Sebasti ˜ao, R.; and Fernandes, J. M. 2017. Supporting the
page-hinkley test with empirical mode decomposition for
change detection. In International Symposium on Method-
ologies for Intelligent Systems , 492–498. Springer.
Silverman, B. W. 2018. Density Estimation for Statistics
and Data Analysis . Monographs on statistics and applied
probability. Boca Raton, FL: CRC Press. ISBN 1-351-45617-
2.
Song, X.; Wu, M.; Jermaine, C.; and Ranka, S. 2007. Statisti-
cal change detection for multi-dimensional data. In Proceed-
ings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining , 667–676.
Souza, V . M.; Parmezan, A. R.; Chowdhury, F. A.; and
Mueen, A. 2021. Efficient unsupervised drift detector for
fast and high-dimensional data streams. Knowledge and
Information Systems , 63: 1497–1527.
Stucchi, D.; Rizzo, P.; Folloni, N.; and Boracchi, G. 2023.
Kernel quanttree. In International Conference on Machine
Learning , 32677–32697. PMLR.
Tahmasbi, A.; Jothimurugesan, E.; Tirthapura, S.; and Gib-
bons, P. B. 2021. Driftsurf: Stable-state/reactive-state learn-
ing under concept drift. In International Conference on
Machine Learning , 10054–10064. PMLR.
Valiant, L. G. 1984. A theory of the learnable. In An-
nual ACM Symposium on Theory of Computing: Proceed-
ings of the sixteenth annual ACM symposium on Theory of
computing , 436–445. New York, NY , USA: ACM. ISBN
0897911334.
Virtanen, P.; Gommers, R.; Oliphant, T. E.; Haberland,
M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.;
Weckesser, W.; Bright, J.; van der Walt, S. J.; Brett, M.;
Wilson, J.; Millman, K. J.; Mayorov, N.; Nelson, A. R. J.;
Jones, E.; Kern, R.; Larson, E.; Carey, C. J.; Polat, ˙I.; Feng,
Y .; Moore, E. W.; VanderPlas, J.; Laxalde, D.; Perktold, J.;
Cimrman, R.; Henriksen, I.; Quintero, E. A.; Harris, C. R.;
Archibald, A. M.; Ribeiro, A. H.; Pedregosa, F.; van Mulbregt,
P.; and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamen-
tal Algorithms for Scientific Computing in Python. Nature
Methods , 17: 261–272.
Wan, K.; Liang, Y .; and Yoon, S. 2024. Online Drift Detection
with Maximum Concept Discrepancy. In Proceedings of the
30th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining .
Wang, K.; Lu, J.; Liu, A.; Song, Y .; Xiong, L.; and Zhang, G.
2022. Elastic gradient boosting decision tree with adaptive
iterations for concept drift adaptation. Neurocomputing , 491:
288–304.
Wang, K.; Lu, J.; Liu, A.; Zhang, G.; and Xiong, L. 2021.
Evolving gradient boost: A pruning scheme based on lossimprovement ratio for learning under concept drift. IEEE
Transactions on Cybernetics .
Yonekawa, K.; Saito, K.; and Kurokawa, M. 2022. RIDEN:
Neural-based Uniform Density Histogram for Distribution
Shift Detection. In Proceedings of the Second International
Conference on AI-ML Systems , 1–9.
Yu, E.; Lu, J.; Zhang, B.; and Zhang, G. 2024. Online Boost-
ing Adaptive Learning under Concept Drift for Multistream
Classification. In Proceedings of the AAAI Conference on
Artificial Intelligence , volume 38, 16522–16530.
Yuan, L.; Li, H.; Xia, B.; Gao, C.; Liu, M.; Yuan, W.; and
You, X. 2022. Recent Advances in Concept Drift Adaptation
Methods for Deep Learning. In IJCAI , 5654–5661.
Zhou, M.; Lu, J.; Song, Y .; and Zhang, G. 2023. Multi-Stream
Concept Drift Self-Adaptation Using Graph Neural Network.
IEEE Transactions on Knowledge and Data Engineering ,
35(12): 12828–12841.
Page 10:
Proof of Identical Error Rate when Concept Drift Occurs
Definition 1. LetXbe an input space and Ybe an output space. A concept is a function c:X → Y that assigns to each input
x∈ X a corresponding label c(x)∈ Y. A classifier h:X → Y is a learned hypothesis that attempts to approximate a given
concept.
We assume a data-generating distribution Ddefined over X, with x∼D. At any time, the true concept cdetermines the label
for each x. If concept drift occurs, the concept changes from an old concept coldto a new concept cnew, potentially altering the
mapping from XtoY. We assume the input distribution Dremains fixed, i.e., p(x)is unchanged, but cold(x)andcnew(x)may
differ on some subset of X.
Definition 2. A loss function L:Y × Y → R≥0measures the discrepancy between the predicted label h(x)and the true label
c(x). For example:
•0-1 loss:
L(h(x), c(x)) =(0,ifh(x) =c(x)
1,otherwise.
•MSE loss:
L(h(x), c(x)) = ( h(x)−c(x))2.
The error rate or generalization error of hunder concept cand distribution Dis given by:
Ex∼D[L(h(x), c(x))].
Theorem 3. Concept drift can occur even if the error rate under a given loss function remains unchanged. More formally, there
exist scenarios such that cold(x)̸=cnew(x)for some x∈ X, yet:
Ex∼D[L(h(x), cnew(x))] =Ex∼D[L(h(x), cold(x))]
holds for both 0-1 and MSE losses.
Proof. Consider a classifier h:X → Y and an unchanged distribution D. Let coldandcnewbe the old and new concepts,
respectively, and define:
Xsame ={x|cold(x) =cnew(x)},
Xdiff={x|cold(x)̸=cnew(x)}.
Case 1: 0-1 Loss. We have:
Ex∼D[L(h(x), cnew(x))] =Ex∼D[L(h(x), cnew(x))·1Xsame] +Ex∼D[L(h(x), cnew(x))·1Xdiff]
=Ex∼D[L(h(x), cold(x))·1Xsame] +Ex∼D[L(h(x), cnew(x))·1Xdiff]
=Ex∼D[L(h(x), cold(x))] +Ex∼D[(L(h(x), cnew(x))− L(h(x), cold(x)))·1Xdiff].
IfEx∼D[(L(h(x), cnew(x))− L(h(x), cold(x)))·1Xdiff] = 0 , then:
Ex∼D[L(h(x), cnew(x))] =Ex∼D[L(h(x), cold(x))],
even though Xdiff̸=∅. Thus, concept drift can occur with no change in error rate under 0-1 loss.
Case 2: MSE Loss. Let∆(x) =cnew(x)−cold(x). Then:
Ex∼D[(h(x)−cnew(x))2] =Ex∼D[(h(x)−cold(x)−∆(x))2]
=Ex∼D[(h(x)−cold(x))2] +Ex∼D[∆(x)2+ 2∆( x)(cold(x)−h(x))].
IfEx∼D[∆(x)2+ 2∆( x)(cold(x)−h(x))] = 0 , then:
Ex∼D[(h(x)−cnew(x))2] =Ex∼D[(h(x)−cold(x))2],
despite the existence of ∆(x)̸= 0for some x. Hence, under MSE loss, concept drift can also remain undetected by the error rate.
In both cases, we find scenarios where concept drift does not manifest as a change in expected error, completing the proof.
Page 11:
Table of notation
Symbol Description
D1,t Streaming data collected during the
period [1, t]
(xj, yj) Instance jwith features xjand label
yj¯Dj Chunk jof data
M Chunk size
P1,t(x, y)Distribution of data in the period
[1, t]
f Classifier
ˆy Predicted class probabilities
n Number of classes
ei Prediction error of instance xi
I(·) Indicator function
ui Uncertainty of instance xi
fyi(xi) Predicted probability of instance xi
belonging to its true class yi
Oij Observed frequency in the i-th row
andj-th column of the contingency
table
Eij Expected frequency for the cell at
thei-th row and j-th column
Hi Histogram built for window i
N Total observed frequencies
χ2Chi-square test statistic
p p-value associated with χ2
w Degrees of freedom
K Hyperparameter in Ei-kMeans
σ Predefined threshold for the Chi-
square test
Table 3: Notation table
Proof of Theorem 1
Proof. LetWbe a window containing a subset of instances
from the data stream D1,t. LetY={1,2, ..., n}be the set of
class labels, where nis the number of classes. For an instance
xiinW, letˆyi= arg max j∈Yfj(xi)be the predicted class.
Define the partition B={B0, B1, ..., B m}ofWas follows:
B0={(xi, yi)∈W|ˆyi=yi}
m[
j=1Bj={(xi, yi)∈W|ˆyi̸=yi}
where Bjare pairwise disjoint, i.e., Bi∩Bj=∅fori̸=
j. Moreover,Sm
j=1Bjcovers the entire range of PU-index
values [0, 1] for misclassified instances
LetHWbe the histogram of Wconstructed based on this
partition, where each bin corresponds to a set in B. Given
HW1=HW2, it implies
|B(1)
j|
|W1|=|B(2)
j|
|W2|,∀j∈ {0,1, ..., m},where | · |denotes set cardinality. The error rate for window
Wis defined as:
¯eW=1
|W|X
(xi,yi)∈Wei=1
|W|mX
j=1|Bj|= 1−|B0|
|W|
Therefore, ¯eW1= 1−|B(1)
0|
|W1|= 1−|B(2)
0|
|W2|= ¯eW2, which
implies that the error rate of W1andW2are equal.
The error standard deviation for window Wis defined as:
σW=vuut1
|W|X
(xi,yi)∈W(ei−¯eW)2
=vuut1
|W|X
(xi,yi)∈W(e2
i−2ei·¯eW+ ¯e2
W)
=vuut1
|W|X
(xi,yi)∈We2
i−2¯eW·1
|W|X
(xi,yi)∈Wei+ ¯e2
W
=vuut1
|W|X
(xi,yi)∈Wei−2¯e2
W+ ¯e2
W
=q
¯eW−¯e2
W
(15)
Since ¯eW1= ¯eW2, we have σW1=σW2.
Therefore, the error rates and error standard deviations of
W1andW2are equal.
Proof of Theorem 2
Proof. We prove the theorem by providing a counterexample
in binary classification case. Consider two windows W1and
W2:
•InW1, 50% of instances have uncertainty in (0.9,1.0],
and the remaining 50% have uncertainty in [0.0,0.1].
•InW2, 50% of instances have uncertainty in (0.8,0.9],
and the remaining 50% have uncertainty in (0.1,0.2].
LetB1={b:b⊆[0,0.5]}andB2={b:b⊆(0.5,1]}
denote the sets of histogram bins below and above the uncer-
tainty value of 0.5, respectively.
In both windows, 50% of instances fall into bins in B2
(uncertainty >0.5) and are misclassified, while the remaining
50% fall into bins in B1(uncertainty ≤0.5) and are correctly
classified. Thus, the error rates of both windows are equal:
¯e1=1
n(1)X
b∈B2n(1)
b= 0.5, (16)
¯e2=1
n(2)X
b∈B2n(2)
b= 0.5. (17)
where n(1)andn(2)are the total numbers of instances in
W1andW2, respectively, and n(1)
bandn(2)
bdenote the num-
ber of instances falling into bin binW1andW2, respectively.
In this case, the error rates of the two windows are equal.
Page 12:
Consequently, the error standard deviations of both win-
dows are also equal:
std1=vuut1
n(1)n(1)X
i=1(e(1)
i−¯e1)2
=p
¯e1(1−¯e1) = 0 .5, (18)
std2=vuut1
n(2)n(2)X
i=1(e(2)
i−¯e2)2
=p
¯e2(1−¯e2) = 0 .5. (19)
However, the bin proportions of the uncertainty histograms
H1andH2are different:
•InH1, 50% of instances fall into the bin (0.9,1.0], and
50% fall into the bin [0.0,0.1].
•InH2, 50% of instances fall into the bin (0.8,0.9], and
50% fall into the bin (0.1,0.2].
Therefore, even though the error standard deviations or
error rates of W1andW2are equal, their uncertainty his-
tograms have different bin proportions, which contradicts the
statement that equal error standard deviations imply identical
bin proportions in the uncertainty histograms.
Pseudo code and time complexity of PUDD
Algorithm 1: Adaptive PU-index Bucketing algorithm drift
detector based on PU-index.
1:Input : Chi-square test threshold σ, PU-index of recent
examples ut1,tderived from sliding window strategy,
hyperparameter K.
2:forr=t1, ..., t do
3: Derive uC
t1,r,uC
r,t,uM
t1,r,uM
r,tfrom equation (1-4).
4: ifMean( uM
t1,r)≥Mean( uM
r,t)then
5: Initialize Kcentroids Cbased on Ei-kMeans cen-
troids initialization algorithm on uC
t1,r.
6: Build clusters based on Ei-kMeans partitioning al-
gorithm on uC
t1,rbased on initial centroids C.
7: For each cluster, count the frequencies of uC
t1,rand
uC
t1,r. Subsequently, calculate the sizes of uM
t1,rand
uM
r,t. Utilize these values to populate the contin-
gency table accordingly.
8: Conduct a Chi-square test on the contingency table
and derive p-value pby Eq (14).
9: P=P∪ {p}.
10: end if
11:end for
12:ifP̸=∅andminP < σ then
13: Raise drift detected alarm.
14:end if
The algorithm framework of the PU-index-based drift de-
tector is provided in Algorithm 1. Assuming the current time
step is tand the last alarm is raised at time step t1, thus therearet−t1chunks in the current sub-stream. The first step is to
split these chunks into two windows. In line 2, we search all
the cutting points from the index t1tot. Then we apply the
Adaptive PU-index Bucketing-based drift detector on the de-
rived two samples from line 3-10. Notice that this brutal-force
search can be processed simultaneously by multiprocess pro-
cessing in implementation. In line 4, we compare the mean of
PU-index on the two samples uM
t1,randuM
r,t, and we skip the
subsequent p-value computation if the mean of uM
t1,ris larger
than that of uM
r,tto speed up the algorithm. From line 5-9, we
built clusters and a contingency table to derive the p-value
for each cutting point. At line 12, we investigate whether the
minimum p-value is smaller than the predefined threshold. If
so, the null hypothesis test is rejected and we claim the con-
cept drift has occurred. The time complexity for the lines 3 is
O(n), where nis the number of examples in the sliding win-
dow. The time complexity for the lines 5-9 is O(K2nlogn)
as introduced in (Liu, Lu, and Zhang 2020). Thus, the total
time complexity of our algorithm is O(K2nlogn).
Pseudo code for online instance based PUDD
Firstly, we introduce how to incrementally update the key
variables, clusters, and contingency tables. Then we provide
the pseudo code of online instance based PUDD algorithm.
Incremental Updates of Sets. At any time t, for each
candidate cut point r∈[t1, t], we have the following sets:
uC
t1,r(t),uM
t1,r(t),uC
r,t(t), and uM
r,t(t), derived from the PU-
index.
Here, uC
t1,r(t)anduM
t1,r(t)represent the correctly and mis-
classified instances, respectively, in the substream from t1to
r, while uC
r,t(t)anduM
r,t(t)represent the correctly and mis-
classified instances, respectively, in the substream from r+ 1
tot.
When time advances to t+ 1, we receive a new instance
ut+1with prediction ˆyt+1and true label yt+1. We update
the sets for all r∈[t1, t+ 1] as follows.
uC
t1,r(t+1) =
uC
t1,r(t), r < t + 1
uC
t1,r(t)∪ {ut+1}, r=t+ 1andˆyt+1=yt+1
uC
t1,r(t), r =t+ 1andˆyt+1̸=yt+1
(20)
uM
t1,r(t+1) =
uM
t1,r(t), r < t + 1
uM
t1,r(t), r =t+ 1andˆyt+1=yt+1
uM
t1,r(t)∪ {ut+1}, r=t+ 1andˆyt+1̸=yt+1
(21)
uC
r,t+1(t+1) =
uC
r,t(t)∪ {ut+1}, r < t + 1andˆyt+1=yt+1
uC
r,t(t), r < t + 1andˆyt+1̸=yt+1
∅, r =t+ 1
(22)
Page 13:
uM
r,t+1(t+1) =
uM
r,t(t)∪ {ut+1}, r < t + 1andˆyt+1̸=yt+1
uM
r,t(t), r < t + 1andˆyt+1=yt+1
∅, r =t+ 1
(23)
In these definitions, if r < t + 1, the first window ut1,r
remains unchanged, and we only incrementally update the
second window ur,t+1by adding the new instance ut+1ac-
cording to its classification correctness. If r=t+ 1, a new
cut point is introduced, and the first window ut1,ris aug-
mented by ut+1if it is correctly classified, while the second
window ur,t+1is initially empty.
Incremental Updates of Clusters. For each r∈[t1, t], at
timet, we have clusters:
Ct1,r(t) ={C1(t), C2(t), . . . , C K(t)}, (24)
obtained by applying Ei-kMeans initialization and subse-
quent kMeans clustering on uC
t1,r(t).
When moving to time t+ 1, the incremental update is:
Ct1,r(t+ 1) =(Ct1,r(t), r < t + 1
kMeans (uC
t1,r(t+ 1), Cinit), r=t+ 1
(25)
Ifr < t + 1, no new instances affect the first window
ut1,r, so we can directly use the previously cached clusters.
Ifr=t+ 1, the arrival of ut+1modifies uC
t1,r(t+ 1), requir-
ing a fresh Ei-kMeans initialization and kMeans run. Once
computed, Ct1,r(t+ 1) is cached for future reuse.
Incremental Updates of Contingency Tables. For a given
r∈[t1, t], at time t, we have a contingency table:
Tt1,r(t)∈R2×(K+1),
where the first row corresponds to (uC
t1,r(t), uM
t1,r(t))and
the second row corresponds to (uC
r,t(t), uM
r,t(t)). The first K
columns represent the counts of correctly classified instances
in each of the Kclusters, and the last column stores the
counts of misclassified instances.
When time advances to t+ 1, we update Tt1,r(t)as:
Tt1,r(t+ 1) =(Tt1,r(t) + ∆ T, r < t + 1
Compute Contingency Table , r=t+ 1
(26)
The term ∆Taccounts for the incremental contribution of
the new instance ut+1. To compute ∆T, we first determine
whether ut+1falls into the first or second window, depending
onr. Next, we check if ut+1is correctly classified or not.
If it is correctly classified, we identify its assigned cluster
using Ct1,r(t+ 1) , and increment the corresponding cluster
count in the table. If it is misclassified, we increment the
misclassified count in the last column. This process involves
only a single increment operation per new instance, making
it highly efficient.
Ifr=t+ 1, since the composition of uC
t1,r(t+ 1) (and
possibly uM
t1,r(t+ 1) ) changes due to the arrival of ut+1,
we must fully recompute the contingency table according toAlgorithm 2: Online Instance based PU-index Bucketing-
based Drift Detector
Require: Current t+ 1, last alarm t1, threshold σ,K,ut1,t,
cached results.
1:P← ∅
2:forr=t1→t+ 1do
3: Update uC
t1,r(t+ 1), uM
t1,r(t+ 1), uC
r,t+1(t+ 1),
uM
r,t+1(t+ 1) using Eqs. (20)–(23)
4: Compute Ct1,r(t+ 1) using Eq. (25)
5: Compute Tt1,r(t+ 1) using Eq. (26)
6: Compute Chi-square p-value pr
7: P=P∪ {pr}
8:end for
9:ifP̸=∅andmin(P)< σ then
10: Raise drift alarm
11:end if
uC
t1,r(t+ 1),uM
t1,r(t+ 1),uC
r,t+1(t+ 1),uM
r,t+1(t+ 1). After
recomputation, Tt1,r(t+ 1) is cached.
Time complexity analysis. By leveraging caching and
incremental updates as defined in Eqs. (20)–(23), (25), and
(26), our online instance-based PUDD algorithm achieves ef-
ficient updates with minimal computational overhead. When
previously computed results are available, there is no need
for a full recomputation of clusters or contingency tables,
thus enabling near-constant-time updates at most time steps.
The pseudo code of the online instance-based PUDD algo-
rithm is shown in Algorithm 2. Consider the computational
cost at each relevant line (as annotated in the algorithm):
•Updating uC
t1,r(t+ 1) , uM
t1,r(t+ 1) , uC
r,t+1(t+
1), uM
r,t+1(t+ 1) using Eqs. (20)–(23) requires O(1)time
(line 3).
•Recomputing clusters (line 4) requires O(1)time if r̸=
t+ 1due to cached reuse, and O(K2nlogn)otherwise,
where nis the number of examples from t1tot+ 1.
•Incrementally updating the contingency table (line 5) is
O(K), as it involves placing a single new instance into an
existing bin structure.
•Computing the Chi-square statistic and deriving the p-
value (line 6) is O(2K).
In total, the time complexity is:
O(K2nlogn+ (t−t1)(1 + 3 K)).
Since the for-loop can be parallelized via multi-threading,
the complexity is effectively reduced to:
O(K2nlogn+ 3K),
for the incremental component. This parallelization, com-
bined with caching and incremental updates, enables efficient
online PU-index-based drift detection with substantially re-
duced overhead while maintaining strong detection capabili-
ties.
Page 14:
Implementation details
Our experiment consists of four parts: comparison with clas-
sic drift detectors, comparison with SOTA methods, compari-
son with baselines on the CIFAR-10-CD dataset, and ablation
study.
The experimental setup was established on a server run-
ning Ubuntu 18.04, which was outfitted with an NVIDIA
A100 GPU and possessed 200GB of RAM. The performance
of drift detectors was evaluated across three classifiers: a
Gaussian Naive Bayes classifier (GNB), a Hoeffding Tree
Classifier (Very Fast Decision Tree, VFDT) (Hulten, Spencer,
and Domingos 2001), and a Deep Neural Network (DNN)
(LeCun, Bengio, and Hinton 2015). GNB was implemented
using the widely-used scikit-learn package (Pedregosa et al.
2011), VFDT via the River package (Montiel et al. 2021),
and the DNN, a three-layer neural network, through the Py-
Torch package. The DNN consists of a multi-layer perceptron
(MLP) followed by an output layer. For most datasets, the
DNN comprises two hidden layers, each with 64 neurons and
ReLU activation. The input dimension is dataset-dependent.
For the airline dataset, we employ three hidden layers (512,
256, and 64 neurons respectively), all using ReLU activation.
The input dimension for this specific case is 679. The final
layer is a fully connected layer mapping from 64 neurons to
the output dimension, which varies based on the datasets.
All experiments have been repeated 100 times with differ-
ent random seeds and average accuracy is reported in this
paper. The bins of the histogram, i.e., k, built by Adaptive
PU-index bucketing are automatically determined by the Ei-
kMeans algorithm. We initialize it as 5. We first introduce
the experiment of classic drift detectors. For each time step,
classifiers receive a data chunk for prediction and then we
feed the error rate into the drift detector for drift detection
and adaptation. For a fair comparison, GNB and VFDT were
adapted by retraining on the current chunk, while DNN adap-
tation involved resetting the last layer. The evaluation was
conducted using two systems: Incremental way (test-and-
train) and Train Once Until Alarm way (training the classifier
only upon new concept detection or initialization). Moreover,
the Adam optimizer was employed with an initial learning
rate of 0.01 for the training of DDN, and the training involved
100 epochs. The setting of the ablation study is the same as
above.
For comparison with SOTA methods, considering that all
SOTA approaches, except MCDD, are based on ensemble
learning, such as AMF and IWE using 10 classifiers, and
others allowing up to 10,000 base classifiers, we ensured
fairness by also using an ensemble approach for PUDD and
MCDD. The classifier of our method comprises 5 DNNs as
base classifiers. The final prediction is derived through soft
voting based on the individual predictions of these classifiers.
Additionally, we combine the prediction uncertainty from
all 5 base classifiers for effective drift detection. We also
use the Adam optimizer with a learning rate of 0.01 for
this experiment but with an increased epoch number of 100.
The adaptation approach of this experiment is resetting the
classifier when a drift is detected. The increased training
epoch number and the adaptation approach of the resetting
model are aimed at enhancing the learning capacity and fine-Datasets Sea-N Sine Mixed Elec2 Airline Powersupply CIFAR-10-CD
# of feature 3 2 4 8 679 4 32x32
# of class 2 2 2 2 2 2 10
# of dataset 100k 100k 100k 45k 58k 29k 50k
chunk size 1000 1000 1000 1000 1000 1000 100
Table 4: Statistics of datasets.
tuning the performance of the Deep Neural Networks within
our ensemble framework.
For the proposed CIFAR-10-CD dataset, which is an image
dataset with concept drift, we utilized ResNet-18 (He et al.
2016) as the classifier model. The optimization was carried
out using the Stochastic Gradient Descent (SGD) optimizer
with a learning rate set to 0.01. We only evaluate incremen-
tally for CIFAR-10-CD datasets due to their high learning
difficulty and the training epoch is set as 5.
Datasets introduction
1.Elec2(Harries 1999): A real-world dataset from the Aus-
tralian electricity market with 45,312 instances (only the
first 45,000 were used and split into 45 chunks). It in-
cludes features like electricity demand and a binary class
label for price direction.
2.Airline(Ikonomovska 2011): This real-world dataset in-
cludes 581,0462 (we only use the first 58,000 instances)
flight schedule records with binary labels for flight de-
lays, containing concept drift due to changes in days, and
times. The dataset is split into 58 chunks. The dataset
is one-hot encoded and the dimension increases to 679
following (Tahmasbi et al. 2021). The dataset can be ob-
tained through the official GitHub repository of the work.
3.PowerSupply(Dau et al. 2019): A real-world dataset of
29,928 (only the first 29,000 instances are used and split
into 29 chunks) hourly power supply records, exhibiting
concept drift potentially due to seasonal changes or differ-
ences between weekdays and weekends.
4.SEA(Bifet et al. 2010): A synthetic dataset from MOA
(Bifet et al. 2010) with 100,000 data points divided into
100 chunks, demonstrating concept drift by changing
thresholds of classification function and including noise
variants. We alter the threshold every tenth chunk.
5.SINE(Gama et al. 2004): Characterized by two attributes,
this dataset involves abrupt concept drift with 100,000
samples (divided into 100 chunks), where labels are deter-
mined by a sine function and altered every tenth chunk.
6.Mixed(Gama et al. 2004): Designed for studying abrupt
concept drift, this dataset comprises 100,000 boolean and
numeric examples. We alter its classification function
every ten chunks to simulate concept drift.
7.CIFAR-10-CD: A variant of the CIFAR-10 image dataset,
modified to simulate concept drift in user interests. The
50,000 images are divided into 100 chunks. The labels of
interests change according to the Markov process.
Page 15:
Baselines introduction
1.DDM(Gama et al. 2004): This method monitors the on-
line error rate of a learning algorithm and declares a new
context when the error is higher than a predefined warning
and drift level.
2.EDDM(Baena-Garcıa et al. 2006): EDDM detects concept
drift, especially in slow and gradual changes, by measur-
ing the distance between classification errors instead of
the error rate.
3.HDDM-A/W(Frias-Blanco et al. 2014): HDDM-A fo-
cuses on abrupt changes using a moving average, while
HDDM-W detects gradual changes with weighted moving
averages, both maintaining the efficiency of the original
HDDM.
4.KSWIN(Raab, Heusinger, and Schleif 2020):
Kolmogorov-Smirnov Windowing uses a sliding
window and the Kolmogorov-Smirnov test to detect
distribution changes in data streams.
5.PH(Sebasti ˜ao and Fernandes 2017): This method en-
hances the Page Hinkley Test by using a data-dependent
second-order intrinsic mode function for more robust
change detection without the need for manual parame-
ter tuning.
6.ADWIN(Bifet and Gavalda 2007): ADWIN dynamically
manages variable-sized data windows for concept drift
in data streams, automatically adjusting the window size
based on detected changes.
7.IWE(Jiao et al. 2022): This method adapts to new con-
cepts by incrementally adjusting the weights of historical
classifiers, using a variable-size window for rapid adapta-
tion to changing data patterns.
8.AMF(Mourtada, Ga ¨ıffas, and Scornet 2021): An online
variant of the Random Forest algorithm, AMF uses Mon-
drian Forests for real-time learning and adaptation. It
prunes the decision trees to adjust to data pattern changes.
9.ADLTER(Wang et al. 2022): It introduces an Adaptive
Iterations method for Gradient Boosting Decision Tree.
The method dynamically adjusts the number of iterations
in response to drift severity. It also retrains classifiers
when necessary.
10.NS(Wang et al. 2021): This method uses a Loss Improve-
ment Ratio to evaluate GBDT weak learners. It proposed
both Naive and Statistical Pruning strategies to adjust the
model by pruning less effective learners in response to
concept drift.
11.MCDD (Wan, Liang, and Yoon 2024): A novel concept
drift detection method that uses maximum concept dis-
crepancy and contrastive learning of concept embeddings
to adaptively identify various forms of concept drift in
high-dimensional data streams.
Observations of PUDD performance
Observation 5: increment learning is generally better than
training only once until alarm way from Table 1. This
observation aligns with the PAC theory (Valiant 1984), which
posits that the error of the classifier is inversely proportionalIncremental Training The other case
Classifier ddm name sea10-I sea20-I sea10-O sea20-O
DNNADWIN 87.89 78.15 84.85 75.04
DDM 88.04 78.12 83.88 74.30
EDDM 83.63 77.32 77.47 72.28
HDDM-A 87.90 77.81 85.58 75.74
HDDM-W 87.92 77.61 84.72 76.04
KSWIN 87.97 78.09 82.31 74.54
PH 87.79 77.92 85.92 75.85
PUDD-1 87.86 77.50 85.91 76.00
PUDD-3 87.80 77.33 86.02 76.35
PUDD-5 87.80 77.38 86.24 76.38
GNBADWIN 86.24 76.90 85.43 75.63
DDM 85.47 76.26 84.58 74.72
EDDM 86.26 76.67 85.78 75.91
HDDM-A 86.63 77.22 85.82 76.25
HDDM-W 85.70 77.44 85.05 76.70
KSWIN 84.34 76.19 82.84 74.73
PH 86.89 77.57 86.18 76.67
PUDD-1 87.24 78.08 86.53 76.89
PUDD-3 87.28 77.86 86.61 77.02
PUDD-5 87.18 77.68 86.55 76.67
VFDTADWIN 85.50 76.05 85.58 75.86
DDM 84.89 75.87 84.62 74.96
EDDM 86.13 76.59 85.97 76.48
HDDM-A 85.62 76.22 85.83 76.41
HDDM-W 85.34 76.44 85.15 76.73
KSWIN 84.53 75.86 82.64 74.66
PH 85.67 76.41 86.15 76.57
PUDD-1 86.35 77.02 86.30 76.85
PUDD-3 86.25 76.90 86.57 76.86
PUDD-5 86.19 76.83 86.19 76.29
Table 5: Comparative analysis against classic drift detectors
across 2 synthetic datasets. The top 3 results are highlighted
in bold and the top 1 results are in both bold and underlined.
PUDD- xrepresents the threshold set as 10−xfor our method.
to the number of training data. The reason is that this training
approach allows classifiers to learn continuously and adapt
to new data. This way, the classifiers produce prediction
uncertainty that better reflects the current concept, leading
to more accurate drift detection. The only exception is that
DNN shows better performance on mixed datasets under
the setting of trained only once until alarm way, e.g. when
the threshold is set as 10−5, the test accuracy is 82.81%
for incremental training vs 84.90% for the counterpart. The
potential reason is the dataset contains both boolean and
numeric datasets, which requires data preprocessing for DNN
to conduct continual learning. However, our experiments aim
to compare PUDD with baselines to verify its effectiveness,
thus we did not specifically design data preprocessing for the
dataset. Moreover, the overall experiment result confirms the
advantage of incremental learning.
Observation 6: our method demonstrates higher en-
hanced performance with DNNs compared to classic ma-
chine learning classifiers, e.g. GNB and VFDT . For exam-
ple, under the incremental training setting, our approach with
Page 16:
ddm name sea10-I sea20-I
AMF 83.70 73.41
IWE 84.73 74.33
NS 84.39 76.00
ADLTER 85.89 76.48
MCD-DD 87.22 77.25
PUDD-1 87.72 76.93
PUDD-3 87.67 77.22
PUDD-5 87.74 77.32
Table 6: Comparison with SOTA methods. The table shows
that our methods PUDD in achieved top-1 in 2 out of 2
datasets, implying the effectiveness of PUDD compared with
SOTA methods.
DNNs surpasses the other two classifiers in 6 out of 8 datasets.
In comparison, ADWIN combined with DNN only exceeds
the performance of the other classifiers in 4 datasets. This
observation suggests that the PU-index generated by DNNs
provides more detailed insights than those from GNB and
VFDT. Such granularity acts as a more informative indicator
for drift detection, and our method, PUDD, capitalizes on this
feature more effectively than the other baseline approaches.
Critical difference diagrams of our experiments
Figure 6: Critical difference diagrams of the rank of PUDD
and classic drift detectors. The figure shows that the PUDD
with threshold 10−3and10−5outperform all classic drift
detectors and the result is statistically significant.
Figure 7: Critical difference diagrams of the rank of PUDD
and SOTA methods. The figure shows that the PUDD out-
performs all SOTA methods and the result is statistically
significant.
Figure 8: Critical difference diagrams of the rank of PUDD
(based on Adaptive PU-index bucketing) and EK (original
Ei-kMeans partitioning). The suffix of each method indicates
the logarithm of the threshold value. The figure shows that
the improvement of the Adaptive PU-index Bucketing with
threshold 10−3and10−5are statistically significant.