loader
Generating audio...

arxiv

Paper 2412.11158

Early Concept Drift Detection via Prediction Uncertainty

Authors: Pengqian Lu, Jie Lu, Anjin Liu, Guangquan Zhang

Published: 2024-12-15

Abstract:

Concept drift, characterized by unpredictable changes in data distribution over time, poses significant challenges to machine learning models in streaming data scenarios. Although error rate-based concept drift detectors are widely used, they often fail to identify drift in the early stages when the data distribution changes but error rates remain constant. This paper introduces the Prediction Uncertainty Index (PU-index), derived from the prediction uncertainty of the classifier, as a superior alternative to the error rate for drift detection. Our theoretical analysis demonstrates that: (1) The PU-index can detect drift even when error rates remain stable. (2) Any change in the error rate will lead to a corresponding change in the PU-index. These properties make the PU-index a more sensitive and robust indicator for drift detection compared to existing methods. We also propose a PU-index-based Drift Detector (PUDD) that employs a novel Adaptive PU-index Bucketing algorithm for detecting drift. Empirical evaluations on both synthetic and real-world datasets demonstrate PUDD's efficacy in detecting drift in structured and image data.

Paper Content: on Alphaxiv
Page 1: Early Concept Drift Detection via Prediction Uncertainty Pengqian Lu1, Jie Lu1*, Anjin Liu1, Guangquan Zhang1 1Australian Artificial Intelligence Institute (AAII), University of Technology Sydney, Ultimo, NSW 2007, Australia {Pengqian.Lu@student., Jie.Lu@, Anjin.Liu@, Guangquan.Zhang@ }uts.edu.au Abstract Concept drift, characterized by unpredictable changes in data distribution over time, poses significant challenges to machine learning models in streaming data scenarios. Although error rate-based concept drift detectors are widely used, they often fail to identify drift in the early stages when the data distribu- tion changes but error rates remain constant. This paper intro- duces the Prediction Uncertainty Index (PU-index), derived from the prediction uncertainty of the classifier, as a superior alternative to the error rate for drift detection. Our theoretical analysis demonstrates that: (1) The PU-index can detect drift even when error rates remain stable. (2) Any change in the error rate will lead to a corresponding change in the PU-index. These properties make the PU-index a more sensitive and ro- bust indicator for drift detection compared to existing methods. We also propose a PU-index-based Drift Detector (PUDD) that employs a novel Adaptive PU-index Bucketing algorithm for detecting drift. Empirical evaluations on both synthetic and real-world datasets demonstrate PUDD’s efficacy in detecting drift in structured and image data. Code — https://github.com/RocStone/PUDD Introduction In real-world applications, such as medical triage (Huggard et al. 2020) or time series forecasting tasks (Miyaguchi and Kajino 2019), the distribution of data may unpre- dictably change over time. This phenomenon, termed concept drift (Yuan et al. 2022), significantly degrades model perfor- mance. Moreover, drift also manifests between clients and servers in federated learning tasks (Jiang, Wang, and Dou 2022), or decision making process (Lu et al. 2020) further complicating the learning process. Error rate-based drift de- tection is one of the most popular approaches to handling concept drift due to its efficiency (Lu et al. 2018a). It contin- uously monitors the classifier’s error rate, issuing an alarm when this rate exceeds a preset threshold (Raab, Heusinger, and Schleif 2020; Frias-Blanco et al. 2014). However, in the early stages of concept drift, a model’s error rate may remain stable. In such a case, error rate-based drift signals are unable to indicate the changes in data dis- tribution. To address this, we explore alternative methods to *†Corresponding author. Copyright ©2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.detect distribution changes before a drop in error rate occurs. Upon rethinking the logic behind error rate-based drift sig- nals, we realize that the distribution of prediction probability from a given model will change prior to the error rate itself. In addition, if the error rate does change, then the distribution of prediction probability must change too. Prediction probability is just an example of a quantita- tive measure of predictive uncertainty in models. We believe that by clearly defining predictive uncertainty and demon- strating that it is a superior alternative to error rates, we can significantly enhance the sensitivity and robustness of drift detection. Therefore, we introduce the Prediction Uncertainty Index (PU-index), which measures the probability assigned by a classifier that an instance does not belong to the true class. Fig. 1 shows an illustrative example. The objective of this paper is to address two key questions: (1) Can we identify a more effective drift detection signal that captures changes in data distribution when the error rate remains stable? (2) Can we theoretically prove that if the drift detection signal shows no significant change, then the model’s error rate will also remain stable? In other words, we seek to develop a superior drift detection signal that can detect drift when the error rate cannot, and to prove that if the new signal fails to detect drift, the error rate will also fail to do so. We believe that these criteria are essential for evaluating and comparing different drift detection signals. The theoretical results provided in this paper show strong evidence for the superiority of the PU-index as a metric for concept drift detection. It exhibits at least equivalent sensitiv- ity to error-based metrics and potentially higher sensitivity in certain scenarios, rendering it a more robust and comprehen- sive measure for identifying concept drift. However, it could be argued that a more sensitive detection method might result in a higher false alarm rate, potentially overreacting to minor fluctuations in model performance. If there is no significant drift in the error rate, why should we still seek to detect it? To mitigate such concern, we employ the Chi-square test, a robust statistical significance test, to the PU-index for drift detection. This approach helps distinguish between meaningful distributional shifts and inconsequential variations, thereby maintaining the benefits of increased sen- sitivity while minimizing false alarms. The p-value obtained from the Pearson’s Chi-square test serves as a precise control mechanism for our tolerance to false alarms. By adjusting thearXiv:2412.11158v1 [cs.LG] 15 Dec 2024 Page 2: Figure 1: Illustrative example of the early stage of concept drift when an error rate-based detector fails to detect concept drift, but a prediction uncertainty-based detector can. The data around decision boundaries have been highlighted in the middle of the figure, showing the distribution gap between test sets 1 and 2. Such a gap implies concept drift occurrence. However, in this case, the error rates of the two test sets are the same. We also provide a theoretical proof in the Appendix to demonstrate the existence of such a case. By contrast, the distribution of prediction uncertainty has changed. The example implies that a prediction uncertainty-based detector can detect drift when an error rate-based detector fails. significance level ( α), we can directly modulate the trade-off between sensitivity and false positive rate. In this paper, we introduce PUDD, a drift detector that uses the Prediction Uncertainty (PU) index to identify con- cept drift. PUDD employs a sliding window approach to remove outdated data and split the historical stream into two samples. It then applies an Adaptive PU-index Bucketing algorithm to automatically construct histograms that meet our theoretical conditions. Using these histograms, we apply Pearson’s Chi-square test to determine if drift has occurred. Our main contributions are: 1.To the best of our knowledge, this is the first systematic study that compares two different drift detection signals with theoretical analysis. 2.We propose a novel drift detection metric called the PU- index, which is theoretically proven always to outperform error rate-based drift measurements. This provides crucial insight into the PU-index as a more sensitive and robust alternative for concept drift detection. 3.To identify concept drift in streaming data through the PU- index, we propose a Prediction Uncertainty index base drift detector. It comprises an Adaptive PU-index Buck- eting algorithm to build a histogram for the PU-index, which meets the condition of our theoretical analysis, to conduct the Pearson’s Chi-square test and detect drift. Literature Review In this section, we examine two approaches to concept drift detection: data distribution-based and error rate-based methods. The former directly addresses the root cause of drift—changes in the data distribution—while the latter fo-cuses on variations in the model’s performance, often achiev- ing higher computational efficiency. Data Distribution-based Methods Data distribution-based methods measure shifts in the un- derlying distribution. For instance, a statistical density es- timation approach is proposed in (Song et al. 2007), en- abling the quantification of differences between two samples. Histogram-based techniques frequently serve to represent dis- tributions in high-dimensional feature spaces (Liu et al. 2017). For example, (Boracchi et al. 2018) and (Yonekawa, Saito, and Kurokawa 2022) introduce hierarchical and dynamically adjustable strategies to construct histograms, respectively. Interval formation can also rely on methods like QuadTree (Coelho, Torres, and de Castro 2023) and K-means clustering (Liu, Lu, and Zhang 2020). Beyond direct histogram or den- sity estimation, some methods incorporate contextual factors (Lu et al. 2018b) or anticipate future distributions. For ex- ample, (Cobb and Van Looveren 2022) uses a context-based CoDiTE function (Park et al. 2021) to detect drift, while (Li et al. 2022) exploits a predictive model for future distribu- tions. There are also approaches that leverage Graph Neural Networks to track and adapt to distribution changes directly (Zhou et al. 2023). Although effective, these strategies can be computationally expensive in high-dimensional data streams (Souza et al. 2021). To date, no existing work uses histograms constructed from prediction uncertainty for drift detection, leaving a notable gap in the literature. To the best of our knowledge, there is no previous work that proposes building a histogram of prediction uncertainty to detect concept drift. Page 3: Error rate-based methods Error rate-based detectors are well-studied and computation- ally efficient. Approaches such as (Gama et al. 2004), (Baena- Garcıa et al. 2006), and (Frias-Blanco et al. 2014) monitor variations in model error rates to detect drift. Adaptive win- dow resizing is explored in (Bifet and Gavalda 2007), and forgetting mechanisms are introduced in (Jiao et al. 2022) to weight classifiers dynamically. More recent strategies apply Gaussian Mixture Models to compare windows (Yu et al. 2024) or enter reactive states upon detecting alarms (Tah- masbi et al. 2021). Despite their efficiency, error rate-based detectors struggle to identify drift when accuracy remains sta- ble, particularly during its early stages (see Fig. 1). To address this limitation, we propose a prediction uncertainty-based ap- proach, capable of detecting shifts even before error rates degrade. This method enhances early detection capabilities and complements existing drift detection strategies. Preliminaries Pearson’s Chi-square test The Pearson’s Chi-square test assesses whether two categor- ical variables are independent. Its null hypothesis assumes independence, and if the computed p-value falls below a chosen significance level, the null hypothesis is rejected, in- dicating potential dependence between the variables. The test’s reliability depends on having sufficiently large observed and expected frequencies. As noted in (EP 1978), the Chi-square test produces valid results when each observed count exceeds 50 and every expected count exceeds 5. Under these conditions, the distribution of the test statistic closely approximates a normal distribution, enhancing the test’s va- lidity. The test relies on a contingency table. For the cell in the i-th row and j-th column, Oijdenotes the observed frequency, and its expected frequency is given by: Eij=ni×nj N, (1) where niandnjare the cumulative frequencies of the re- spective row and column, and Nis the sum of the table. The Chi-square test statistic is derived as: χ2=X iX j O2 ij Eij! −N. (2) Correspondingly, the p-value associated with χ2is: p= 1−Zχ2 0xw 2−1·e−x 2 2w 2·Γw 2dx, (3) where w indicates the degrees of freedom, calculated as: w= (number of columns −1)×(number of rows −1).(4) Space Partitioning Algorithms Space partitioning algorithms are widely studied to build his- tograms to establish density estimators (Silverman 2018). The key is to split feature space into partitions and count the in- stances falling into it to build a histogram. The partitions canbe built by QuanTree (Boracchi et al. 2018), Kernel Quant- Tree (Stucchi et al. 2023), or Neural Network (Yonekawa, Saito, and Kurokawa 2022). Particularly, we introduce the Ei-kMeans space partitioning algorithm (Liu, Lu, and Zhang 2020), which can automatically determine the number and size of partitions. Given two samples AandB, Ei-kMeans initializes cen- troids by iteratively selecting N/K points from a copy of A, where N=|A|andKis the hyperparameter of the kMeans algorithm. Each iteration: (1) Select the point in Awith the largest 1-NN distance as zi. (2) Removes ziand its N/K - nearest neighbors from A. This process repeats N/K times, yielding N/K initial centroids. Then the kMeans algorithm is applied to Awith the initial centroids to derive Kclusters denoted as {Ci|i∈[1,N K]}. To ensure that the number of examples in each cluster is larger than 5 to be able to conduct the Chi-square test, an amplify-shrink algorithm is proposed to adjust the num- ber of examples in each cluster. Let us denote the number of instances in the clusters as V={Ci||i∈[1,N K]}. The distance matrix between the examples in Aand the cen- ters of each cluster is denoted as Mdist∈RN×K, which is amplified by: Mdist=Mdist⊙ 1·eθ·(V N−1) ,where 1∈RN×1is an all-ones matrix, θdenotes the hyperparam- eter controlling the shape of the coefficient function, . Let Mij distdenote the amplified distance between i-th data Ai andj-th center cj, the assigned cluster for Aiis defined as yi= arg min j=1...KMij dist. After the amplify-shrink algorithm, the final clusters are derived and can be considered as subspaces in the feature space of A. The numbers of examples of AandBin the clusters are counted to form a histogram. Methodology This section sets up the problem and provides a theoretical analysis demonstrating the advantages of the PU-index over error rate-based methods. We then introduce a novel slid- ing window strategy and an Adaptive PU-index Bucketing algorithm for concept drift detection and adaptation. Problem Setup Formally, we represent the streaming data collected dur- ing the period [1, t]asD1,t={(xj, yj)|j∈[1, t]}. If the data is collected in chunks, then the stream includes a set of chunks D1,t={¯Dj|j∈[1, t]}, where each chunk ¯Dj={(xjk, yjk)|k∈[1, M]}includes Mexamples. Here, xjkrepresents an instance with ddimensional attributes, yjk denotes the corresponding label, and Mdenotes the chunk size. In this paper, we focus only on the data collected in chunks. If the stream D1,tfollows a distribution P1,t(x, y), following (Lu et al. 2018a), we claim that a drift occurs at timet+ 1if P1,t(x, y)̸=Pt,∞(x, y). (5) The goal of concept drift detection is to raise an alarm at time t+ 1 when the distribution of data changes. The most popular metric for detecting the distribution change is prediction error. For a classifier f, assuming the number of Page 4: Figure 2: The framework of our proposed algorithm. The sliding window strategy has two components, i.e., antiquated data discard and cutting point exploration as shown on the left. The Adaptive PU-index Bucketing algorithm is shown in the middle. The drift detection process is shown on the right. classes is n, the classifier will output a prediction probability for each class ˆy∈RnandPn i=1ˆy= 1. The prediction error of an instance xiis defined as: ei=I(ˆyj= arg max jfj(xi)̸=yi), (6) where I(·)is the indicator function. As we mentioned earlier, our motivation is that the predic- tion probability will intuitively change before the prediction error when drift occurs. In this paper, we measure the predic- tion probability by the PU-index which is defined as: ui= 1−fyi(xi), (7) where fyi(xi)denotes the probability predicted by the classi- fier that xibelongs to the ground truth class yi. Theoretical Analysis of Error Rate and PU-index To rigorously evaluate the efficacy of these two metrics for concept drift detection, we conduct a theoretical comparison from two complementary perspectives. (1) When the PU- index distribution remains stable, potentially failing to detect concept drift, we investigate whether the error rate distribu- tion exhibits changes that could indicate drift. (2) Conversely, when the error rate distribution remains constant, we examine whether the PU-index distribution demonstrates changes that might reveal underlying changes in the data stream. Theorem 1. LetW1andW2be two windows of a data stream in a multi-class classification problem. If their respective PU-index histograms H1andH2are identical, where the histograms are constructed such that the first bin contains all misclassified instances and the remaining bins partition the misclassified instances, then the error rates and error standard deviations of W1andW2are equal.Theorem 2. Given a multi-class classification problem, if two windows have equal error standard deviations or error rates, their PU-index histograms, where the first bin contains all correctly classified instances and the remaining bins par- tition the misclassified instances, may not have identical bin proportions. Due to the page limit, the proofs are provided in the Ap- pendix. These theorems lead to the following conclusions: (1) Theorem 1 demonstrates that when the PU-index distribution remains stable, the error rate and the error standard deviation also remain constant. This implies that if the PU-index fails to detect concept drift, error-based metrics will also fail to detect it. (2) Theorem 2 establishes that even when error rates and error standard deviations are equal between two win- dows, the PU-index distributions may differ. This suggests that the PU-index has the potential to detect subtle changes in the data distribution that are not captured by traditional error-based metrics. These findings show that the PU-index offers at least the same sensitivity as error-based metrics and potentially higher sensitivity in certain scenarios, making it a more robust and comprehensive measure for detecting concept drift in streaming data environments. Sliding Window Strategy and Adaptive PU-index Bucketing Algorithm To detect concept drift using the PU-index without making it overly sensitive, we apply the Chi-square test to examine the PU-index distribution. The hypotheses are: Null Hypothesis ( H0): The PU-index distribution does not change over time, indicating no concept drift. Alternative Hypothesis ( H1): The PU-index distribution changes over time, indicating concept drift. If the Chi-square statistic exceeds the critical value at the chosen significance level, we reject H0and conclude that Page 5: concept drift has occurred. Otherwise, we detect no signif- icant drift. Thus, detecting concept drift using Chi-square involves two key steps: (1) partitioning the data stream into two windows and (2) constructing a histogram of the col- lected PU-index values. We adopt a sliding window strategy to handle online streaming data. Let D1,t={¯Dj|j∈[1, t]}denote PU-index chunks collected from the start of the stream, potentially from different distributions. For instance, suppose D1,t1and Dt1,tdiffer in distribution. If a new chunk ¯Dt+1matches the distribution of Dt1,t, no drift should be detected. However, keeping antiquated data D1,t1could trigger a false alarm, since D1,t1andDt1,t+1differ in distribution. To solve this ”antiquated distribution” problem, we dis- card outdated data after detecting a drift at time t1. Subse- quent drift detection uses only Dt1,t+1, avoiding false alarms caused by old data. After discarding antiquated data, we must determine how to form two windows on the current substream. We do this by exploring all possible cutting points r∈[t1, t+ 1]. Thus, Dt1,t+1is split into Dt1,randDr,t+1 for the Adaptive PU-index Bucketing algorithm. The sliding window is illustrated on the left side of Fig. 2. Theoretical analysis shows that misclassified instances’ PU-indices must be grouped into the same bin. For counter- part, we use Ei-kMeans to form bins that meet the Chi-square test requirements. We call this the Adaptive PU-index Buck- eting algorithm, illustrated in the middle of Fig. 2. PU-index based Drift Detector In this subsection, we introduce the overall of our method. Firstly, given a substream containing the recent chunks’ PU- index ut1,t, we explore all cutting points r∈[t1, t]. Based on the cutting points, we have t−t1window pairs, denoted asut1,r, andur,t. Then we defined the PU-index pairs for correctly and wrongly classified instances as: uC t1,r={u|ui∈ut1,r∧ˆyi=yi}, (8) uC r,t={u|ui∈ur,t∧ˆyi=yi}, (9) uM t1,r={u|ui∈ut1,r∧ˆyi̸=yi}, (10) uM r,t={u|ui∈ur,t∧ˆyi̸=yi}. (11) These four equations denote the PU-index of correctly clas- sified instances on the first and second window, and the PU- index of misclassified instances on the first and second win- dow respectively. Next, we compute the contingency table T∈R2×(K+1), where Kis a hyperparameter in Ei-kMeans. To calculate T, firstly, we apply the Adaptive PU-index Buck- eting algorithm on uC t1,rto build a histogram. Then we count the instances from uC t1,rfalling into the bins of the histogram and fill them into T1iwhere i∈[1, K]. Likewise, we count the examples of uC r,tfalling into the previously obtained bins and fill them into T2i. Finally, we fill the T1,K+1, andT2,K+1 by the size of uM t1,randuM r,t. Then we define the expected frequency of Tijas: Eij=PK+1 j=1Tij×P2 i=1TijP ijTij. (12) Figure 3: Illustrative example of generating CIFAR-10-CD through the Markov process. The classes marked in red rep- resent the user’s initial interest and are considered positive labels. All other classes are considered negative labels. The Chi-square test statistic is defined as: χ2=X iX j T2 ij Eij! −X ijTij. (13) Finally, the p-value is computed by: p= 1−Zχ2 0xK 2−1·e−x 2 2K 2·ΓK 2dx, (14) Based on the Equation (12-14), we can compute the p-value for each window pair Dt1,r, and Dr,t. If the minimum p- value among all window pairs is smaller than a predefined threshold σ, we raise a drift detected alarm. It is important to clarify that the specified threshold controls the Type I error rate for each individual window pair test rather than the overall Type I error across the entire substream. Consequently, we do not employ multiple comparison adjustments since our statistical guarantees apply at the single-test level rather than the family-wise level. The pseudo-code and time complexity analysis is provided in the Appendix. Experiments In this section, we introduce the settings and results of the experiments in our paper. The details of implementation, datasets, baselines, and the critical difference diagrams (Is- mail Fawaz et al. 2019) for the experiments in this paper are introduced in the Appendix. Datasets and Baselines We propose CIFAR-10-CD, a synthetic concept drift im- age dataset with an illustrative example in Fig. 3, to sim- ulate user interests changing via a Markov process. Initially, three CIFAR-10 classes are marked positive, with interest shifts occurring probabilistically (e.g. 1% chance of Plane to Horse transfer). Our experiments utilize 3 real-world datasets (airline(Ikonomovska 2011), elec2(Harries 1999), powersup- ply(Dau et al. 2019)) and 4 synthetic sets (sine(Gama et al. Page 6: Incremental Training Training only at Initialization or Adaptation Classifier ddm name airline-I elec2-I mixed-I ps-I sea0-I sine-I airline-O elec2-O mixed-O ps-O sea0-O sine-O DNNADWIN 61.65 71.94 78.45 71.12 97.70 79.27 59.36 69.71 84.41 65.95 95.18 86.59 DDM 61.29 71.02 80.72 71.37 96.86 86.41 56.79 69.51 83.03 69.35 93.55 80.32 EDDM 62.13 70.39 78.01 65.21 96.89 75.34 61.17 69.22 71.16 67.57 92.18 76.54 HDDM-A 62.70 71.16 76.70 68.97 97.84 81.65 59.13 68.98 84.27 67.81 95.11 86.82 HDDM-W 61.50 71.23 77.07 68.74 97.84 85.43 62.42 68.48 84.32 66.89 92.24 86.57 KSWIN 63.02 70.56 78.58 70.80 97.88 79.00 61.34 69.08 84.43 66.46 91.60 83.66 PH 62.15 72.25 79.49 70.86 97.73 78.07 60.36 68.24 84.36 68.83 95.21 86.88 PUDD-1 63.31 74.92 77.39 72.25 97.94 86.19 60.90 69.35 82.65 71.47 94.89 83.39 PUDD-3 63.21 74.93 80.05 72.23 98.04 85.12 60.16 68.98 84.65 70.37 95.99 84.97 PUDD-5 63.35 74.92 82.81 72.24 98.23 82.51 60.19 68.68 84.90 70.20 96.29 85.09 GNBADWIN 50.17 68.90 83.95 70.06 94.18 82.49 54.66 68.30 83.62 68.78 93.93 82.45 DDM 52.94 67.75 83.82 69.63 93.97 82.07 52.43 67.60 83.59 67.52 93.48 81.90 EDDM 62.72 67.73 83.19 70.04 94.06 83.12 54.11 67.64 74.65 70.04 94.06 73.88 HDDM-A 52.80 67.73 83.92 70.87 94.28 83.25 55.62 67.73 83.66 71.24 93.96 83.36 HDDM-W 48.66 67.73 83.91 71.06 92.11 82.83 48.62 67.71 83.60 69.53 91.63 83.25 KSWIN 49.84 67.87 83.92 71.23 91.72 81.88 48.83 67.63 83.59 67.99 90.02 81.32 PH 49.35 70.12 83.88 70.36 94.34 83.54 49.02 70.04 83.61 68.67 94.12 83.10 PUDD-1 53.57 70.85 82.99 71.88 94.61 83.12 51.05 62.76 79.23 71.13 94.25 81.48 PUDD-3 53.03 70.85 83.92 71.59 94.81 83.39 49.45 59.32 83.58 71.20 94.60 83.80 PUDD-5 52.16 70.69 84.12 71.59 94.85 83.43 54.37 59.44 83.96 70.40 94.62 83.38 VFDTADWIN 60.39 73.98 84.30 71.69 94.77 87.11 61.22 74.29 83.54 68.78 92.96 85.74 DDM 60.16 74.82 84.14 70.68 94.86 86.50 59.28 74.75 82.31 67.53 93.63 82.23 EDDM 61.19 73.81 83.15 70.06 94.17 85.59 62.31 73.81 73.73 70.06 93.60 77.72 HDDM-A 60.95 73.90 84.40 70.84 95.20 87.53 60.29 73.83 83.66 71.24 93.93 85.21 HDDM-W 61.11 73.80 84.40 70.99 93.50 87.45 61.92 73.73 83.59 69.54 91.76 85.28 KSWIN 61.30 74.10 84.42 71.27 93.40 86.22 62.06 74.10 83.57 67.51 89.84 82.60 PH 60.95 73.70 83.56 70.88 94.69 87.15 60.97 73.99 83.30 68.67 93.85 85.19 PUDD-1 61.38 73.86 84.25 71.77 95.13 87.33 61.16 69.79 82.13 71.13 94.10 82.21 PUDD-3 61.57 73.84 83.94 71.79 95.21 87.42 59.90 69.79 84.04 71.20 94.63 85.81 PUDD-5 61.57 73.64 84.01 71.79 95.24 87.63 57.04 71.83 84.16 70.40 94.56 86.01 Table 1: Comparative analysis against classic drift detectors across 3 synthetic and 3 real-world datasets. The top 3 results are highlighted in bold and the top 1 results are in both bold and underlined. PUDD- xrepresents the threshold set as 10−xfor our method. The ps is short of powersupply dataset. Results for dataset sea10 and sea20 is provided in Appendix. 2004), mixed(Gama et al. 2004), CIFAR-10-CD, sea vari- ants(Bifet et al. 2010)). We compare against 7 classic detec- tors (ADWIN(Bifet and Gavalda 2007), DDM(Gama et al. 2004), EDDM(Baena-Garcıa et al. 2006), HDDM-A(Frias- Blanco et al. 2014), HDDM-W(Frias-Blanco et al. 2014), KSWIN(Raab, Heusinger, and Schleif 2020), PH(Sebasti ˜ao and Fernandes 2017)) and 5 SOTA methods (MCDD(Wan, Liang, and Yoon 2024), AMF(Mourtada, Ga ¨ıffas, and Scor- net 2021), IWE(Jiao et al. 2022), NS(Wang et al. 2021), and ADLTER(Wang et al. 2022)). Comparison with Baselines and Ablation Studies In this subsection, we compare our method with 7 classic drift detectors and 5 SOTA methods on 9 datasets (including a vari- ant of the SEA dataset). Due to page constraints, results for SEA10 and SEA20 appear in the Appendix. We evaluate all methods using three classifiers—DNN (architecture detailed in the Appendix), Gaussian Naive Bayes (GNB) (Virtanen et al. 2020), and VFDT (Hulten, Spencer, and Domingos 2001)—under two training regimes: incremental (dataset-I)and one-time training at an alarm (dataset-O). Results for the comparison with classic detectors are pre- sented in Table 1, and those for SOTA methods are given in Table 2. For CIFAR-10-CD, due to its learning complexity, we only use incremental training and report results in Fig. 4. Our method is denoted as PUDD-X , where X represents the exponent in 10−X. Based on these experiments, we derive 6 observations. We introduce 4 of them here and leave the remaining 2 in the Appendix. Observation 1: our method shows stronger perfor- mance compared to classic drift detectors as evidenced by the results presented in Table 1, Fig. 4, and additional results in Appenidx. In incremental learning settings, PUDD ranks first in 17 out of 24 cases across different datasets and classifiers, and it is in the top 3 in 20 out of 24 cases. When trained only initially or with adaptation, it still performs well. It achieves first rank in 15 cases and top 3 in 19 cases. This shows that PUDD is particularly effective with incremental training. Results in Fig. 4 show PUDD outperformed all the baselines, which demonstrates the superiority of our method Page 7: ddm name airline elec2 mixed ps sea0 sine AMF 38.56 66.24 49.49 69.63 93.67 49.52 IWE 38.02 68.90 49.47 64.10 93.14 49.51 NS 67.91 76.42 81.09 72.39 93.54 91.01 ADLTER 70.00 76.10 87.63 72.48 93.40 92.18 MCD-DD 63.65 69.81 86.68 71.66 97.66 90.21 PUDD-1 63.78 77.28 89.51 72.68 98.47 94.52 PUDD-3 64.62 76.77 89.47 72.79 98.44 94.76 PUDD-5 64.45 76.92 89.37 72.74 98.49 90.90 Table 2: Comparison with SOTA methods. The dataset ps is short for powersupply. The results for dataset sea10 and sea20 is provided in Appnedix. The table shows that our methods PUDD in achieved top-1 in 5 out of 6 datasets, implying the effectiveness of PUDD compared with SOTA methods. in detecting the concept drift in the image dataset. Observation 2: PUDD performs better with a smaller threshold as revealed in Table 1 and additional results in Ap- pendix. In incremental learning scenarios, PUDD-1, PUDD-3, and PUDD-5 achieve top 1 in 5, 5, and 8 cases respectively. When tested in training only once until alarm way, PUDD-1, PUDD-3, and PUDD-5 achieved top 1 in 2, 6, and 8 cases re- spectively. PUDD consistently shows improved performance at lower thresholds in both scenarios. As detailed in the Sen- sitivity of PU-index section, a drift alarm triggers when the p- value is below the threshold, with lower thresholds indicating stricter conditions for alarm detection. Therefore, PUDD’s better performance with smaller thresholds suggests a high sensitivity to drift. Observation 3: PUDD shows very competitive perfor- mance compared to SOTA methods. As shown in Table 2 and additional results in Appendix, our method attains the top rank in 7 out of 8 cases. On certain datasets, this im- provement is particularly pronounced. For instance, PUDD-5 achieves a 98.49% accuracy, which is 2.8% higher than the best SOTA method. The only exception occurs in the airline dataset, where NS and ADLTER outperform PUDD. This discrepancy can be explained by the airline dataset’s tabular nature and its numerous attributes, which are more effectively modeled through tree-based ensemble learning utilized by these SOTA methods. Moreover, these methods adapt to drift by adjusting ensembles rather than discarding and retraining them. In contrast, PUDD relies on retraining classifiers solely on recent data, which may not be suitable for attribute-rich datasets like the airline dataset. Nevertheless, for all other datasets, the results confirm that PUDD surpasses SOTA methods, thereby underscoring its overall superiority. Observation 4: The Adaptive PU-index Bucketing algo- rithm outperforms Ei-kMeans. Figure 5 shows that PUDD surpasses Ei-kMeans across various datasets, classifier train- ing methods, and threshold settings. The critical difference diagram in the Appendix shows that improvements at thresh- olds10−3and10−5are statistically significant. In summary, these results confirm the theoretical bene- fits of the PU-index for drift detection. PUDD outperforms both classic and SOTA detectors, and the Adaptive PU-index Bucketing algorithm shows significant improvements over Figure 4: Comparison with baselines on CIFAR-10-CD, ex- cluding methods unable to detect drift in image datasets. Figure 5: Accuracy comparison between PUDD (using Adap- tive PU-index Bucketing) and Ei-kMeans (EK). We show average accuracy across 9 datasets using 3 classifiers. Ei-kMeans. This validates the PU-index as a sensitive, robust indicator capable of detecting drift even when error rates remain unchanged, thereby overcoming a major shortcoming of error rate-based approaches. Conclusion and Future Work In our study, we demonstrated that the PU-index, as opposed to the error rate, is a more effective measure for detecting concept drift in machine learning models. We utilized the Adaptive PU-index bucketing algorithm to partition the PU- index and the Chi-square test to detect concept drift. We also introduced a technique for inducing concept drift in image datasets by simulating changes in user interest. We validated our method through experiments on both synthetic and real- world datasets. Future work should focus on automating drift alarm threshold determination, as current methods rely on manual settings that may not remain optimal over time. Our research also uncovers a method for generating multi-stream concept drift in image datasets by emulating shifts in user interests using Markov matrices, offering valuable insights for research in multistream concept drift learning. Acknowledgments This work was supported by the Australian Research Council through the Laureate Fellow Project under Grant FL190100149. Page 8: References Baena-Garcıa, M.; del Campo- ´Avila, J.; Fidalgo, R.; Bifet, A.; Gavalda, R.; and Morales-Bueno, R. 2006. Early drift de- tection method. In Fourth international workshop on knowl- edge discovery from data streams , volume 6, 77–86. Citeseer. Bifet, A.; and Gavalda, R. 2007. Learning from time- changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining , 443–448. SIAM. Bifet, A.; Holmes, G.; Pfahringer, B.; Kranen, P.; Kremer, H.; Jansen, T.; and Seidl, T. 2010. Moa: Massive online analysis, a framework for stream classification and clustering. In Proceedings of the first workshop on applications of pattern analysis , 44–50. PMLR. Boracchi, G.; Carrera, D.; Cervellera, C.; and Maccio, D. 2018. QuantTree: Histograms for change detection in multi- variate data streams. In International Conference on Machine Learning , 639–648. PMLR. Cobb, O.; and Van Looveren, A. 2022. Context-aware drift detection. In International conference on machine learning , 4087–4111. PMLR. Coelho, R. A.; Torres, L. C. B.; and de Castro, C. L. 2023. Concept Drift Detection with Quadtree-based Spatial Map- ping of Streaming Data. Information Sciences . Dau, H. A.; Bagnall, A.; Kamgar, K.; Yeh, C.-C. M.; Zhu, Y .; Gharghabi, S.; Ratanamahatana, C. A.; and Keogh, E. 2019. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica , 6(6): 1293–1305. EP, B. G. 1978. Statistics for Experimenters: An Introduction to Design. Data Analysis, and Model Building , 43–48. Frias-Blanco, I.; del Campo- ´Avila, J.; Ramos-Jimenez, G.; Morales-Bueno, R.; Ortiz-Diaz, A.; and Caballero-Mota, Y . 2014. Online and non-parametric drift detection methods based on Hoeffding’s bounds. IEEE Transactions on Knowl- edge and Data Engineering , 27(3): 810–823. Gama, J.; Medas, P.; Castillo, G.; and Rodrigues, P. 2004. Learning with drift detection. In Advances in Artificial Intelligence–SBIA 2004: 17th Brazilian Symposium on Arti- ficial Intelligence, Sao Luis, Maranhao, Brazil, September 29-Ocotber 1, 2004. Proceedings 17 , 286–295. Springer. Harries, M. 1999. Splice-2 comparative evaluation: electricity pricing (technical report unsw-cse-tr-9905). Artificial Intelli- gence Group, School of Computer Science and Engineering, The University of New South Wales, Sydney , 2052. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , 770– 778. Huggard, H.; Koh, Y . S.; Dobbie, G.; and Zhang, E. 2020. Detecting concept drift in medical triage. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , 1733–1736. Hulten, G.; Spencer, L.; and Domingos, P. 2001. Mining time- changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining , 97–106.Ikonomovska, E. 2011. Airline dataset. URL http://kt. ijs. si/elena ikonomovska/data. html.(Accessed on 02/06/2020) . Ismail Fawaz, H.; Forestier, G.; Weber, J.; Idoumghar, L.; and Muller, P.-A. 2019. Deep learning for time series classi- fication: a review. Data Mining and Knowledge Discovery , 33(4): 917–963. Jiang, M.; Wang, Z.; and Dou, Q. 2022. Harmofl: Harmo- nizing local and global drifts in federated learning on het- erogeneous medical images. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 36, 1087–1095. Jiao, B.; Guo, Y .; Yang, C.; Pu, J.; Zheng, Z.; and Gong, D. 2022. Incremental Weighted Ensemble for Data Streams with Concept Drift. IEEE Transactions on Artificial Intelligence . LeCun, Y .; Bengio, Y .; and Hinton, G. 2015. Deep learning. nature , 521(7553): 436–444. Li, W.; Yang, X.; Liu, W.; Xia, Y .; and Bian, J. 2022. Ddg- da: Data distribution generation for predictable concept drift adaptation. In Proceedings of the AAAI Conference on Artifi- cial Intelligence , volume 36, 4092–4100. Liu, A.; Lu, J.; and Zhang, G. 2020. Concept drift detec- tion via equal intensity k-means space partitioning. IEEE transactions on cybernetics , 51(6): 3198–3211. Liu, A.; Song, Y .; Zhang, G.; and Lu, J. 2017. Regional concept drift detection and density synchronized drift adap- tation. In IJCAI International Joint Conference on Artificial Intelligence . Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; and Zhang, G. 2018a. Learning under concept drift: A review. IEEE trans- actions on knowledge and data engineering , 31(12): 2346– 2363. Lu, J.; Liu, A.; Song, Y .; and Zhang, G. 2020. Data-driven decision support under concept drift in streamed big data. Complex & intelligent systems , 6(1): 157–163. Lu, J.; Xuan, J.; Zhang, G.; and Luo, X. 2018b. Structural property-aware multilayer network embedding for latent fac- tor analysis. Pattern Recognition , 76: 228–241. Miyaguchi, K.; and Kajino, H. 2019. Cogra: Concept-drift- aware stochastic gradient descent for time-series forecasting. InProceedings of the AAAI Conference on Artificial Intelli- gence , volume 33, 4594–4601. Montiel, J.; Halford, M.; Mastelini, S. M.; Bolmier, G.; Sourty, R.; Vaysse, R.; Zouitine, A.; Gomes, H. M.; Read, J.; Abdessalem, T.; et al. 2021. River: machine learning for streaming data in python. Journal of Machine Learning Research , 22(110): 1–8. Mourtada, J.; Ga ¨ıffas, S.; and Scornet, E. 2021. AMF: Ag- gregated Mondrian forests for online learning. Journal of the Royal Statistical Society Series B: Statistical Methodology , 83(3): 505–533. Park, J.; Shalit, U.; Sch ¨olkopf, B.; and Muandet, K. 2021. Conditional distributional treatment effect with kernel con- ditional mean embeddings and U-statistic regression. In International Conference on Machine Learning , 8401–8412. PMLR. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V .; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, Page 9: R.; Dubourg, V .; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- learn: Machine Learning in Python. Journal of Machine Learning Research , 12: 2825–2830. Raab, C.; Heusinger, M.; and Schleif, F.-M. 2020. Reactive soft prototype computing for concept drift streams. Neuro- computing , 416: 340–351. Sebasti ˜ao, R.; and Fernandes, J. M. 2017. Supporting the page-hinkley test with empirical mode decomposition for change detection. In International Symposium on Method- ologies for Intelligent Systems , 492–498. Springer. Silverman, B. W. 2018. Density Estimation for Statistics and Data Analysis . Monographs on statistics and applied probability. Boca Raton, FL: CRC Press. ISBN 1-351-45617- 2. Song, X.; Wu, M.; Jermaine, C.; and Ranka, S. 2007. Statisti- cal change detection for multi-dimensional data. In Proceed- ings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining , 667–676. Souza, V . M.; Parmezan, A. R.; Chowdhury, F. A.; and Mueen, A. 2021. Efficient unsupervised drift detector for fast and high-dimensional data streams. Knowledge and Information Systems , 63: 1497–1527. Stucchi, D.; Rizzo, P.; Folloni, N.; and Boracchi, G. 2023. Kernel quanttree. In International Conference on Machine Learning , 32677–32697. PMLR. Tahmasbi, A.; Jothimurugesan, E.; Tirthapura, S.; and Gib- bons, P. B. 2021. Driftsurf: Stable-state/reactive-state learn- ing under concept drift. In International Conference on Machine Learning , 10054–10064. PMLR. Valiant, L. G. 1984. A theory of the learnable. In An- nual ACM Symposium on Theory of Computing: Proceed- ings of the sixteenth annual ACM symposium on Theory of computing , 436–445. New York, NY , USA: ACM. ISBN 0897911334. Virtanen, P.; Gommers, R.; Oliphant, T. E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; van der Walt, S. J.; Brett, M.; Wilson, J.; Millman, K. J.; Mayorov, N.; Nelson, A. R. J.; Jones, E.; Kern, R.; Larson, E.; Carey, C. J.; Polat, ˙I.; Feng, Y .; Moore, E. W.; VanderPlas, J.; Laxalde, D.; Perktold, J.; Cimrman, R.; Henriksen, I.; Quintero, E. A.; Harris, C. R.; Archibald, A. M.; Ribeiro, A. H.; Pedregosa, F.; van Mulbregt, P.; and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamen- tal Algorithms for Scientific Computing in Python. Nature Methods , 17: 261–272. Wan, K.; Liang, Y .; and Yoon, S. 2024. Online Drift Detection with Maximum Concept Discrepancy. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . Wang, K.; Lu, J.; Liu, A.; Song, Y .; Xiong, L.; and Zhang, G. 2022. Elastic gradient boosting decision tree with adaptive iterations for concept drift adaptation. Neurocomputing , 491: 288–304. Wang, K.; Lu, J.; Liu, A.; Zhang, G.; and Xiong, L. 2021. Evolving gradient boost: A pruning scheme based on lossimprovement ratio for learning under concept drift. IEEE Transactions on Cybernetics . Yonekawa, K.; Saito, K.; and Kurokawa, M. 2022. RIDEN: Neural-based Uniform Density Histogram for Distribution Shift Detection. In Proceedings of the Second International Conference on AI-ML Systems , 1–9. Yu, E.; Lu, J.; Zhang, B.; and Zhang, G. 2024. Online Boost- ing Adaptive Learning under Concept Drift for Multistream Classification. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, 16522–16530. Yuan, L.; Li, H.; Xia, B.; Gao, C.; Liu, M.; Yuan, W.; and You, X. 2022. Recent Advances in Concept Drift Adaptation Methods for Deep Learning. In IJCAI , 5654–5661. Zhou, M.; Lu, J.; Song, Y .; and Zhang, G. 2023. Multi-Stream Concept Drift Self-Adaptation Using Graph Neural Network. IEEE Transactions on Knowledge and Data Engineering , 35(12): 12828–12841. Page 10: Proof of Identical Error Rate when Concept Drift Occurs Definition 1. LetXbe an input space and Ybe an output space. A concept is a function c:X → Y that assigns to each input x∈ X a corresponding label c(x)∈ Y. A classifier h:X → Y is a learned hypothesis that attempts to approximate a given concept. We assume a data-generating distribution Ddefined over X, with x∼D. At any time, the true concept cdetermines the label for each x. If concept drift occurs, the concept changes from an old concept coldto a new concept cnew, potentially altering the mapping from XtoY. We assume the input distribution Dremains fixed, i.e., p(x)is unchanged, but cold(x)andcnew(x)may differ on some subset of X. Definition 2. A loss function L:Y × Y → R≥0measures the discrepancy between the predicted label h(x)and the true label c(x). For example: •0-1 loss: L(h(x), c(x)) =(0,ifh(x) =c(x) 1,otherwise. •MSE loss: L(h(x), c(x)) = ( h(x)−c(x))2. The error rate or generalization error of hunder concept cand distribution Dis given by: Ex∼D[L(h(x), c(x))]. Theorem 3. Concept drift can occur even if the error rate under a given loss function remains unchanged. More formally, there exist scenarios such that cold(x)̸=cnew(x)for some x∈ X, yet: Ex∼D[L(h(x), cnew(x))] =Ex∼D[L(h(x), cold(x))] holds for both 0-1 and MSE losses. Proof. Consider a classifier h:X → Y and an unchanged distribution D. Let coldandcnewbe the old and new concepts, respectively, and define: Xsame ={x|cold(x) =cnew(x)}, Xdiff={x|cold(x)̸=cnew(x)}. Case 1: 0-1 Loss. We have: Ex∼D[L(h(x), cnew(x))] =Ex∼D[L(h(x), cnew(x))·1Xsame] +Ex∼D[L(h(x), cnew(x))·1Xdiff] =Ex∼D[L(h(x), cold(x))·1Xsame] +Ex∼D[L(h(x), cnew(x))·1Xdiff] =Ex∼D[L(h(x), cold(x))] +Ex∼D[(L(h(x), cnew(x))− L(h(x), cold(x)))·1Xdiff]. IfEx∼D[(L(h(x), cnew(x))− L(h(x), cold(x)))·1Xdiff] = 0 , then: Ex∼D[L(h(x), cnew(x))] =Ex∼D[L(h(x), cold(x))], even though Xdiff̸=∅. Thus, concept drift can occur with no change in error rate under 0-1 loss. Case 2: MSE Loss. Let∆(x) =cnew(x)−cold(x). Then: Ex∼D[(h(x)−cnew(x))2] =Ex∼D[(h(x)−cold(x)−∆(x))2] =Ex∼D[(h(x)−cold(x))2] +Ex∼D[∆(x)2+ 2∆( x)(cold(x)−h(x))]. IfEx∼D[∆(x)2+ 2∆( x)(cold(x)−h(x))] = 0 , then: Ex∼D[(h(x)−cnew(x))2] =Ex∼D[(h(x)−cold(x))2], despite the existence of ∆(x)̸= 0for some x. Hence, under MSE loss, concept drift can also remain undetected by the error rate. In both cases, we find scenarios where concept drift does not manifest as a change in expected error, completing the proof. Page 11: Table of notation Symbol Description D1,t Streaming data collected during the period [1, t] (xj, yj) Instance jwith features xjand label yj¯Dj Chunk jof data M Chunk size P1,t(x, y)Distribution of data in the period [1, t] f Classifier ˆy Predicted class probabilities n Number of classes ei Prediction error of instance xi I(·) Indicator function ui Uncertainty of instance xi fyi(xi) Predicted probability of instance xi belonging to its true class yi Oij Observed frequency in the i-th row andj-th column of the contingency table Eij Expected frequency for the cell at thei-th row and j-th column Hi Histogram built for window i N Total observed frequencies χ2Chi-square test statistic p p-value associated with χ2 w Degrees of freedom K Hyperparameter in Ei-kMeans σ Predefined threshold for the Chi- square test Table 3: Notation table Proof of Theorem 1 Proof. LetWbe a window containing a subset of instances from the data stream D1,t. LetY={1,2, ..., n}be the set of class labels, where nis the number of classes. For an instance xiinW, letˆyi= arg max j∈Yfj(xi)be the predicted class. Define the partition B={B0, B1, ..., B m}ofWas follows: B0={(xi, yi)∈W|ˆyi=yi} m[ j=1Bj={(xi, yi)∈W|ˆyi̸=yi} where Bjare pairwise disjoint, i.e., Bi∩Bj=∅fori̸= j. Moreover,Sm j=1Bjcovers the entire range of PU-index values [0, 1] for misclassified instances LetHWbe the histogram of Wconstructed based on this partition, where each bin corresponds to a set in B. Given HW1=HW2, it implies |B(1) j| |W1|=|B(2) j| |W2|,∀j∈ {0,1, ..., m},where | · |denotes set cardinality. The error rate for window Wis defined as: ¯eW=1 |W|X (xi,yi)∈Wei=1 |W|mX j=1|Bj|= 1−|B0| |W| Therefore, ¯eW1= 1−|B(1) 0| |W1|= 1−|B(2) 0| |W2|= ¯eW2, which implies that the error rate of W1andW2are equal. The error standard deviation for window Wis defined as: σW=vuut1 |W|X (xi,yi)∈W(ei−¯eW)2 =vuut1 |W|X (xi,yi)∈W(e2 i−2ei·¯eW+ ¯e2 W) =vuut1 |W|X (xi,yi)∈We2 i−2¯eW·1 |W|X (xi,yi)∈Wei+ ¯e2 W =vuut1 |W|X (xi,yi)∈Wei−2¯e2 W+ ¯e2 W =q ¯eW−¯e2 W (15) Since ¯eW1= ¯eW2, we have σW1=σW2. Therefore, the error rates and error standard deviations of W1andW2are equal. Proof of Theorem 2 Proof. We prove the theorem by providing a counterexample in binary classification case. Consider two windows W1and W2: •InW1, 50% of instances have uncertainty in (0.9,1.0], and the remaining 50% have uncertainty in [0.0,0.1]. •InW2, 50% of instances have uncertainty in (0.8,0.9], and the remaining 50% have uncertainty in (0.1,0.2]. LetB1={b:b⊆[0,0.5]}andB2={b:b⊆(0.5,1]} denote the sets of histogram bins below and above the uncer- tainty value of 0.5, respectively. In both windows, 50% of instances fall into bins in B2 (uncertainty >0.5) and are misclassified, while the remaining 50% fall into bins in B1(uncertainty ≤0.5) and are correctly classified. Thus, the error rates of both windows are equal: ¯e1=1 n(1)X b∈B2n(1) b= 0.5, (16) ¯e2=1 n(2)X b∈B2n(2) b= 0.5. (17) where n(1)andn(2)are the total numbers of instances in W1andW2, respectively, and n(1) bandn(2) bdenote the num- ber of instances falling into bin binW1andW2, respectively. In this case, the error rates of the two windows are equal. Page 12: Consequently, the error standard deviations of both win- dows are also equal: std1=vuut1 n(1)n(1)X i=1(e(1) i−¯e1)2 =p ¯e1(1−¯e1) = 0 .5, (18) std2=vuut1 n(2)n(2)X i=1(e(2) i−¯e2)2 =p ¯e2(1−¯e2) = 0 .5. (19) However, the bin proportions of the uncertainty histograms H1andH2are different: •InH1, 50% of instances fall into the bin (0.9,1.0], and 50% fall into the bin [0.0,0.1]. •InH2, 50% of instances fall into the bin (0.8,0.9], and 50% fall into the bin (0.1,0.2]. Therefore, even though the error standard deviations or error rates of W1andW2are equal, their uncertainty his- tograms have different bin proportions, which contradicts the statement that equal error standard deviations imply identical bin proportions in the uncertainty histograms. Pseudo code and time complexity of PUDD Algorithm 1: Adaptive PU-index Bucketing algorithm drift detector based on PU-index. 1:Input : Chi-square test threshold σ, PU-index of recent examples ut1,tderived from sliding window strategy, hyperparameter K. 2:forr=t1, ..., t do 3: Derive uC t1,r,uC r,t,uM t1,r,uM r,tfrom equation (1-4). 4: ifMean( uM t1,r)≥Mean( uM r,t)then 5: Initialize Kcentroids Cbased on Ei-kMeans cen- troids initialization algorithm on uC t1,r. 6: Build clusters based on Ei-kMeans partitioning al- gorithm on uC t1,rbased on initial centroids C. 7: For each cluster, count the frequencies of uC t1,rand uC t1,r. Subsequently, calculate the sizes of uM t1,rand uM r,t. Utilize these values to populate the contin- gency table accordingly. 8: Conduct a Chi-square test on the contingency table and derive p-value pby Eq (14). 9: P=P∪ {p}. 10: end if 11:end for 12:ifP̸=∅andminP < σ then 13: Raise drift detected alarm. 14:end if The algorithm framework of the PU-index-based drift de- tector is provided in Algorithm 1. Assuming the current time step is tand the last alarm is raised at time step t1, thus therearet−t1chunks in the current sub-stream. The first step is to split these chunks into two windows. In line 2, we search all the cutting points from the index t1tot. Then we apply the Adaptive PU-index Bucketing-based drift detector on the de- rived two samples from line 3-10. Notice that this brutal-force search can be processed simultaneously by multiprocess pro- cessing in implementation. In line 4, we compare the mean of PU-index on the two samples uM t1,randuM r,t, and we skip the subsequent p-value computation if the mean of uM t1,ris larger than that of uM r,tto speed up the algorithm. From line 5-9, we built clusters and a contingency table to derive the p-value for each cutting point. At line 12, we investigate whether the minimum p-value is smaller than the predefined threshold. If so, the null hypothesis test is rejected and we claim the con- cept drift has occurred. The time complexity for the lines 3 is O(n), where nis the number of examples in the sliding win- dow. The time complexity for the lines 5-9 is O(K2nlogn) as introduced in (Liu, Lu, and Zhang 2020). Thus, the total time complexity of our algorithm is O(K2nlogn). Pseudo code for online instance based PUDD Firstly, we introduce how to incrementally update the key variables, clusters, and contingency tables. Then we provide the pseudo code of online instance based PUDD algorithm. Incremental Updates of Sets. At any time t, for each candidate cut point r∈[t1, t], we have the following sets: uC t1,r(t),uM t1,r(t),uC r,t(t), and uM r,t(t), derived from the PU- index. Here, uC t1,r(t)anduM t1,r(t)represent the correctly and mis- classified instances, respectively, in the substream from t1to r, while uC r,t(t)anduM r,t(t)represent the correctly and mis- classified instances, respectively, in the substream from r+ 1 tot. When time advances to t+ 1, we receive a new instance ut+1with prediction ˆyt+1and true label yt+1. We update the sets for all r∈[t1, t+ 1] as follows. uC t1,r(t+1) =  uC t1,r(t), r < t + 1 uC t1,r(t)∪ {ut+1}, r=t+ 1andˆyt+1=yt+1 uC t1,r(t), r =t+ 1andˆyt+1̸=yt+1 (20) uM t1,r(t+1) =  uM t1,r(t), r < t + 1 uM t1,r(t), r =t+ 1andˆyt+1=yt+1 uM t1,r(t)∪ {ut+1}, r=t+ 1andˆyt+1̸=yt+1 (21) uC r,t+1(t+1) =  uC r,t(t)∪ {ut+1}, r < t + 1andˆyt+1=yt+1 uC r,t(t), r < t + 1andˆyt+1̸=yt+1 ∅, r =t+ 1 (22) Page 13: uM r,t+1(t+1) =  uM r,t(t)∪ {ut+1}, r < t + 1andˆyt+1̸=yt+1 uM r,t(t), r < t + 1andˆyt+1=yt+1 ∅, r =t+ 1 (23) In these definitions, if r < t + 1, the first window ut1,r remains unchanged, and we only incrementally update the second window ur,t+1by adding the new instance ut+1ac- cording to its classification correctness. If r=t+ 1, a new cut point is introduced, and the first window ut1,ris aug- mented by ut+1if it is correctly classified, while the second window ur,t+1is initially empty. Incremental Updates of Clusters. For each r∈[t1, t], at timet, we have clusters: Ct1,r(t) ={C1(t), C2(t), . . . , C K(t)}, (24) obtained by applying Ei-kMeans initialization and subse- quent kMeans clustering on uC t1,r(t). When moving to time t+ 1, the incremental update is: Ct1,r(t+ 1) =(Ct1,r(t), r < t + 1 kMeans (uC t1,r(t+ 1), Cinit), r=t+ 1 (25) Ifr < t + 1, no new instances affect the first window ut1,r, so we can directly use the previously cached clusters. Ifr=t+ 1, the arrival of ut+1modifies uC t1,r(t+ 1), requir- ing a fresh Ei-kMeans initialization and kMeans run. Once computed, Ct1,r(t+ 1) is cached for future reuse. Incremental Updates of Contingency Tables. For a given r∈[t1, t], at time t, we have a contingency table: Tt1,r(t)∈R2×(K+1), where the first row corresponds to (uC t1,r(t), uM t1,r(t))and the second row corresponds to (uC r,t(t), uM r,t(t)). The first K columns represent the counts of correctly classified instances in each of the Kclusters, and the last column stores the counts of misclassified instances. When time advances to t+ 1, we update Tt1,r(t)as: Tt1,r(t+ 1) =(Tt1,r(t) + ∆ T, r < t + 1 Compute Contingency Table , r=t+ 1 (26) The term ∆Taccounts for the incremental contribution of the new instance ut+1. To compute ∆T, we first determine whether ut+1falls into the first or second window, depending onr. Next, we check if ut+1is correctly classified or not. If it is correctly classified, we identify its assigned cluster using Ct1,r(t+ 1) , and increment the corresponding cluster count in the table. If it is misclassified, we increment the misclassified count in the last column. This process involves only a single increment operation per new instance, making it highly efficient. Ifr=t+ 1, since the composition of uC t1,r(t+ 1) (and possibly uM t1,r(t+ 1) ) changes due to the arrival of ut+1, we must fully recompute the contingency table according toAlgorithm 2: Online Instance based PU-index Bucketing- based Drift Detector Require: Current t+ 1, last alarm t1, threshold σ,K,ut1,t, cached results. 1:P← ∅ 2:forr=t1→t+ 1do 3: Update uC t1,r(t+ 1), uM t1,r(t+ 1), uC r,t+1(t+ 1), uM r,t+1(t+ 1) using Eqs. (20)–(23) 4: Compute Ct1,r(t+ 1) using Eq. (25) 5: Compute Tt1,r(t+ 1) using Eq. (26) 6: Compute Chi-square p-value pr 7: P=P∪ {pr} 8:end for 9:ifP̸=∅andmin(P)< σ then 10: Raise drift alarm 11:end if uC t1,r(t+ 1),uM t1,r(t+ 1),uC r,t+1(t+ 1),uM r,t+1(t+ 1). After recomputation, Tt1,r(t+ 1) is cached. Time complexity analysis. By leveraging caching and incremental updates as defined in Eqs. (20)–(23), (25), and (26), our online instance-based PUDD algorithm achieves ef- ficient updates with minimal computational overhead. When previously computed results are available, there is no need for a full recomputation of clusters or contingency tables, thus enabling near-constant-time updates at most time steps. The pseudo code of the online instance-based PUDD algo- rithm is shown in Algorithm 2. Consider the computational cost at each relevant line (as annotated in the algorithm): •Updating uC t1,r(t+ 1) , uM t1,r(t+ 1) , uC r,t+1(t+ 1), uM r,t+1(t+ 1) using Eqs. (20)–(23) requires O(1)time (line 3). •Recomputing clusters (line 4) requires O(1)time if r̸= t+ 1due to cached reuse, and O(K2nlogn)otherwise, where nis the number of examples from t1tot+ 1. •Incrementally updating the contingency table (line 5) is O(K), as it involves placing a single new instance into an existing bin structure. •Computing the Chi-square statistic and deriving the p- value (line 6) is O(2K). In total, the time complexity is: O(K2nlogn+ (t−t1)(1 + 3 K)). Since the for-loop can be parallelized via multi-threading, the complexity is effectively reduced to: O(K2nlogn+ 3K), for the incremental component. This parallelization, com- bined with caching and incremental updates, enables efficient online PU-index-based drift detection with substantially re- duced overhead while maintaining strong detection capabili- ties. Page 14: Implementation details Our experiment consists of four parts: comparison with clas- sic drift detectors, comparison with SOTA methods, compari- son with baselines on the CIFAR-10-CD dataset, and ablation study. The experimental setup was established on a server run- ning Ubuntu 18.04, which was outfitted with an NVIDIA A100 GPU and possessed 200GB of RAM. The performance of drift detectors was evaluated across three classifiers: a Gaussian Naive Bayes classifier (GNB), a Hoeffding Tree Classifier (Very Fast Decision Tree, VFDT) (Hulten, Spencer, and Domingos 2001), and a Deep Neural Network (DNN) (LeCun, Bengio, and Hinton 2015). GNB was implemented using the widely-used scikit-learn package (Pedregosa et al. 2011), VFDT via the River package (Montiel et al. 2021), and the DNN, a three-layer neural network, through the Py- Torch package. The DNN consists of a multi-layer perceptron (MLP) followed by an output layer. For most datasets, the DNN comprises two hidden layers, each with 64 neurons and ReLU activation. The input dimension is dataset-dependent. For the airline dataset, we employ three hidden layers (512, 256, and 64 neurons respectively), all using ReLU activation. The input dimension for this specific case is 679. The final layer is a fully connected layer mapping from 64 neurons to the output dimension, which varies based on the datasets. All experiments have been repeated 100 times with differ- ent random seeds and average accuracy is reported in this paper. The bins of the histogram, i.e., k, built by Adaptive PU-index bucketing are automatically determined by the Ei- kMeans algorithm. We initialize it as 5. We first introduce the experiment of classic drift detectors. For each time step, classifiers receive a data chunk for prediction and then we feed the error rate into the drift detector for drift detection and adaptation. For a fair comparison, GNB and VFDT were adapted by retraining on the current chunk, while DNN adap- tation involved resetting the last layer. The evaluation was conducted using two systems: Incremental way (test-and- train) and Train Once Until Alarm way (training the classifier only upon new concept detection or initialization). Moreover, the Adam optimizer was employed with an initial learning rate of 0.01 for the training of DDN, and the training involved 100 epochs. The setting of the ablation study is the same as above. For comparison with SOTA methods, considering that all SOTA approaches, except MCDD, are based on ensemble learning, such as AMF and IWE using 10 classifiers, and others allowing up to 10,000 base classifiers, we ensured fairness by also using an ensemble approach for PUDD and MCDD. The classifier of our method comprises 5 DNNs as base classifiers. The final prediction is derived through soft voting based on the individual predictions of these classifiers. Additionally, we combine the prediction uncertainty from all 5 base classifiers for effective drift detection. We also use the Adam optimizer with a learning rate of 0.01 for this experiment but with an increased epoch number of 100. The adaptation approach of this experiment is resetting the classifier when a drift is detected. The increased training epoch number and the adaptation approach of the resetting model are aimed at enhancing the learning capacity and fine-Datasets Sea-N Sine Mixed Elec2 Airline Powersupply CIFAR-10-CD # of feature 3 2 4 8 679 4 32x32 # of class 2 2 2 2 2 2 10 # of dataset 100k 100k 100k 45k 58k 29k 50k chunk size 1000 1000 1000 1000 1000 1000 100 Table 4: Statistics of datasets. tuning the performance of the Deep Neural Networks within our ensemble framework. For the proposed CIFAR-10-CD dataset, which is an image dataset with concept drift, we utilized ResNet-18 (He et al. 2016) as the classifier model. The optimization was carried out using the Stochastic Gradient Descent (SGD) optimizer with a learning rate set to 0.01. We only evaluate incremen- tally for CIFAR-10-CD datasets due to their high learning difficulty and the training epoch is set as 5. Datasets introduction 1.Elec2(Harries 1999): A real-world dataset from the Aus- tralian electricity market with 45,312 instances (only the first 45,000 were used and split into 45 chunks). It in- cludes features like electricity demand and a binary class label for price direction. 2.Airline(Ikonomovska 2011): This real-world dataset in- cludes 581,0462 (we only use the first 58,000 instances) flight schedule records with binary labels for flight de- lays, containing concept drift due to changes in days, and times. The dataset is split into 58 chunks. The dataset is one-hot encoded and the dimension increases to 679 following (Tahmasbi et al. 2021). The dataset can be ob- tained through the official GitHub repository of the work. 3.PowerSupply(Dau et al. 2019): A real-world dataset of 29,928 (only the first 29,000 instances are used and split into 29 chunks) hourly power supply records, exhibiting concept drift potentially due to seasonal changes or differ- ences between weekdays and weekends. 4.SEA(Bifet et al. 2010): A synthetic dataset from MOA (Bifet et al. 2010) with 100,000 data points divided into 100 chunks, demonstrating concept drift by changing thresholds of classification function and including noise variants. We alter the threshold every tenth chunk. 5.SINE(Gama et al. 2004): Characterized by two attributes, this dataset involves abrupt concept drift with 100,000 samples (divided into 100 chunks), where labels are deter- mined by a sine function and altered every tenth chunk. 6.Mixed(Gama et al. 2004): Designed for studying abrupt concept drift, this dataset comprises 100,000 boolean and numeric examples. We alter its classification function every ten chunks to simulate concept drift. 7.CIFAR-10-CD: A variant of the CIFAR-10 image dataset, modified to simulate concept drift in user interests. The 50,000 images are divided into 100 chunks. The labels of interests change according to the Markov process. Page 15: Baselines introduction 1.DDM(Gama et al. 2004): This method monitors the on- line error rate of a learning algorithm and declares a new context when the error is higher than a predefined warning and drift level. 2.EDDM(Baena-Garcıa et al. 2006): EDDM detects concept drift, especially in slow and gradual changes, by measur- ing the distance between classification errors instead of the error rate. 3.HDDM-A/W(Frias-Blanco et al. 2014): HDDM-A fo- cuses on abrupt changes using a moving average, while HDDM-W detects gradual changes with weighted moving averages, both maintaining the efficiency of the original HDDM. 4.KSWIN(Raab, Heusinger, and Schleif 2020): Kolmogorov-Smirnov Windowing uses a sliding window and the Kolmogorov-Smirnov test to detect distribution changes in data streams. 5.PH(Sebasti ˜ao and Fernandes 2017): This method en- hances the Page Hinkley Test by using a data-dependent second-order intrinsic mode function for more robust change detection without the need for manual parame- ter tuning. 6.ADWIN(Bifet and Gavalda 2007): ADWIN dynamically manages variable-sized data windows for concept drift in data streams, automatically adjusting the window size based on detected changes. 7.IWE(Jiao et al. 2022): This method adapts to new con- cepts by incrementally adjusting the weights of historical classifiers, using a variable-size window for rapid adapta- tion to changing data patterns. 8.AMF(Mourtada, Ga ¨ıffas, and Scornet 2021): An online variant of the Random Forest algorithm, AMF uses Mon- drian Forests for real-time learning and adaptation. It prunes the decision trees to adjust to data pattern changes. 9.ADLTER(Wang et al. 2022): It introduces an Adaptive Iterations method for Gradient Boosting Decision Tree. The method dynamically adjusts the number of iterations in response to drift severity. It also retrains classifiers when necessary. 10.NS(Wang et al. 2021): This method uses a Loss Improve- ment Ratio to evaluate GBDT weak learners. It proposed both Naive and Statistical Pruning strategies to adjust the model by pruning less effective learners in response to concept drift. 11.MCDD (Wan, Liang, and Yoon 2024): A novel concept drift detection method that uses maximum concept dis- crepancy and contrastive learning of concept embeddings to adaptively identify various forms of concept drift in high-dimensional data streams. Observations of PUDD performance Observation 5: increment learning is generally better than training only once until alarm way from Table 1. This observation aligns with the PAC theory (Valiant 1984), which posits that the error of the classifier is inversely proportionalIncremental Training The other case Classifier ddm name sea10-I sea20-I sea10-O sea20-O DNNADWIN 87.89 78.15 84.85 75.04 DDM 88.04 78.12 83.88 74.30 EDDM 83.63 77.32 77.47 72.28 HDDM-A 87.90 77.81 85.58 75.74 HDDM-W 87.92 77.61 84.72 76.04 KSWIN 87.97 78.09 82.31 74.54 PH 87.79 77.92 85.92 75.85 PUDD-1 87.86 77.50 85.91 76.00 PUDD-3 87.80 77.33 86.02 76.35 PUDD-5 87.80 77.38 86.24 76.38 GNBADWIN 86.24 76.90 85.43 75.63 DDM 85.47 76.26 84.58 74.72 EDDM 86.26 76.67 85.78 75.91 HDDM-A 86.63 77.22 85.82 76.25 HDDM-W 85.70 77.44 85.05 76.70 KSWIN 84.34 76.19 82.84 74.73 PH 86.89 77.57 86.18 76.67 PUDD-1 87.24 78.08 86.53 76.89 PUDD-3 87.28 77.86 86.61 77.02 PUDD-5 87.18 77.68 86.55 76.67 VFDTADWIN 85.50 76.05 85.58 75.86 DDM 84.89 75.87 84.62 74.96 EDDM 86.13 76.59 85.97 76.48 HDDM-A 85.62 76.22 85.83 76.41 HDDM-W 85.34 76.44 85.15 76.73 KSWIN 84.53 75.86 82.64 74.66 PH 85.67 76.41 86.15 76.57 PUDD-1 86.35 77.02 86.30 76.85 PUDD-3 86.25 76.90 86.57 76.86 PUDD-5 86.19 76.83 86.19 76.29 Table 5: Comparative analysis against classic drift detectors across 2 synthetic datasets. The top 3 results are highlighted in bold and the top 1 results are in both bold and underlined. PUDD- xrepresents the threshold set as 10−xfor our method. to the number of training data. The reason is that this training approach allows classifiers to learn continuously and adapt to new data. This way, the classifiers produce prediction uncertainty that better reflects the current concept, leading to more accurate drift detection. The only exception is that DNN shows better performance on mixed datasets under the setting of trained only once until alarm way, e.g. when the threshold is set as 10−5, the test accuracy is 82.81% for incremental training vs 84.90% for the counterpart. The potential reason is the dataset contains both boolean and numeric datasets, which requires data preprocessing for DNN to conduct continual learning. However, our experiments aim to compare PUDD with baselines to verify its effectiveness, thus we did not specifically design data preprocessing for the dataset. Moreover, the overall experiment result confirms the advantage of incremental learning. Observation 6: our method demonstrates higher en- hanced performance with DNNs compared to classic ma- chine learning classifiers, e.g. GNB and VFDT . For exam- ple, under the incremental training setting, our approach with Page 16: ddm name sea10-I sea20-I AMF 83.70 73.41 IWE 84.73 74.33 NS 84.39 76.00 ADLTER 85.89 76.48 MCD-DD 87.22 77.25 PUDD-1 87.72 76.93 PUDD-3 87.67 77.22 PUDD-5 87.74 77.32 Table 6: Comparison with SOTA methods. The table shows that our methods PUDD in achieved top-1 in 2 out of 2 datasets, implying the effectiveness of PUDD compared with SOTA methods. DNNs surpasses the other two classifiers in 6 out of 8 datasets. In comparison, ADWIN combined with DNN only exceeds the performance of the other classifiers in 4 datasets. This observation suggests that the PU-index generated by DNNs provides more detailed insights than those from GNB and VFDT. Such granularity acts as a more informative indicator for drift detection, and our method, PUDD, capitalizes on this feature more effectively than the other baseline approaches. Critical difference diagrams of our experiments Figure 6: Critical difference diagrams of the rank of PUDD and classic drift detectors. The figure shows that the PUDD with threshold 10−3and10−5outperform all classic drift detectors and the result is statistically significant. Figure 7: Critical difference diagrams of the rank of PUDD and SOTA methods. The figure shows that the PUDD out- performs all SOTA methods and the result is statistically significant. Figure 8: Critical difference diagrams of the rank of PUDD (based on Adaptive PU-index bucketing) and EK (original Ei-kMeans partitioning). The suffix of each method indicates the logarithm of the threshold value. The figure shows that the improvement of the Adaptive PU-index Bucketing with threshold 10−3and10−5are statistically significant.

---