1 Introduction
Testing is a critical part of the quality assurance process for models used in realworld applications and reducing the time spent on model development. Traditional testing takes a model and a test set as input, and outputs quality metrics such as accuracy, precision, or recall. A critical assumption in such a process is that the test set is uniformly and randomly sampled from production traffic. If this assumption is violated the test set results can no longer be used as an indicator of how the model will perform in the production environment. Unfortunately, production data continually changes over time. A test set that is perfectly representative today may be hopelessly out of date next week if production data shifts. One solution to this problem is to continually refresh the test set by sampling new data from production. Taken to the extreme, the entire test set could be resampled on a regular basis. However, labeling data is an expensive and laborious process. There is typically a strong desire to minimize the amount of labeling that occurs, and when labeling does occur, priority is often given to expanding model training data rather than test sets (settles2009active; bachman2017learning; konyushkova2017learning). As a result, test sets often drift out of date and the metrics they produce have little relevance to what is happening in production. This paper addresses the above problems by proposing a new formulation of testing that drastically reduces the labeling effort required to maintain a high quality test set. It does so by breaking the assumption that the test set distribution needs to match that of the production data. In our approach, performance metrics are produced via a performance predictor which takes as input not only a model and test set, but also a batch of unlabeled production data. By having access to both the test set and production data, a performance predictor can observe any distributional differences between them and use this information to compute more accurate predictive metrics. Our experimental results show that this technique can reduce the labeling effort needed to achieve a given test set error by 80%100% over a range of datasets and drift scenarios.
The main contributions of this paper are as follows: (1) we demonstrate the effectiveness of performance prediction and test set resampling in approximating the model accuracy on an unlabeled dataset, (2) we evaluate several variants of test set update strategies, and (3) draw the attention in labeling effort reduction research from the training stage of learning models to the testing stage, considering the practical scenarios where retraining of the model is expensive and possibly unnecessary. Results are presented on multiple datasets spanning a variety of classification tasks.
1.1 Problem Formulation
Our technique assumes to have a trained base model , a batch of unlabeled data samples , and constant labeling cost per individual data point. Given a budget to label up to points from , let , , be an index subset identifying elements of that have been labeled, with
the set of the corresponding labels. We seek to generate an accuracy estimate
using only the labels such that the following quantity is minimized:(1) 
Above, denotes a set of labels for the entire dataset and denotes the expectation. While refers to the conventional accuracy calculation for a model , stands for any function generating an estimate of such a metric, in particular, including output of predictive models. Further expanding the above scenario: suppose there is an infinite sequence of batches
. The joint distribution of samples and labels,
, may change from batch to batch and retraining is expensive. Suppose we maintain a working test set as a result of processing past batches . For , the following questions are relevant: is there an optimum strategy to (a) add (label) and, (b) remove previously labeled points from the test set , s.t. Eq. (1) is minimized on ? While the above formulation asks for an optimum strategy, such a strategy may only exist under certain assumptions, e.g., for certain model types and data constraints. In general, the above problem is underdetermined and we defer a theoretical analysis to future work. In this paper, we tackle a relaxed version of the problem, namely generating accuracy estimates that require fewer–albeit not minimum–labeled points compared to their conventional alternative. We show empirically that a better approximation of can be found by using these three methods: performance prediction, uncertainty prioritization, and resampling, while labeling only samples to stay within budget.1.2 Related Work
Model testing in machine learning (ML) is generally defined as the oracle problem
in that the ground truth for the model output is not available. This definition is valid for both supervised and unsupervised learning
(xie2018mettle). Several studies use adversarial samples to reveal the model’s failure cases, and the model is retrained based on those “adversarial” samples. To enhance this formulation, several prioritization methods are suggested to find “best” samples first (zhang2019noise; byun2019input). Instead of detecting failure cases, ML testing can also be formulated as an accuracy prediction problem (li2019boosting). The advantage of this approach is in taking the operational context into account, unlike with the adversarial example based methods. A more extensive survey on machine learning testing workflow, its importance and its components can be found in (zhang2019machine).In this paper, we also consider testing as an accuracy (or performance) prediction problem. However, we tackle the problem considering the full model lifecycle. In the work of (li2019boosting), a trained model and an unlabeled dataset (operational data) is leveraged to determine which samples to label from the operational context, then those samples are used as the test set to calculate the traditional accuracy estimate. Our method differs in that (1) it admits the existence of an “outdated” test set at the beginning of the process and tackles its continuing update, and (2) it performs the accuracy prediction on the (larger) unlabeled operational data, instead of the traditional estimate from labels.
2 Preliminaries
2.1 Sample Selection Bias Correction
In the presence of sample selection bias (SSB), the empirical distribution of a finite dataset differs from its underlying population distribution. Let
denote a random variable indicating a particular instance is selected in the sample. If the probability of
depends on the feature or label variables, the selection is said to be biased. SSB is a widely studied problem (elkan2001foundations; zadrozny2004learning; zadrozny2003cost; fan2005improved; dudik2006correcting) as it contributes to discrepancies between training and testing data in machine learning. A special case of interest in our investigations is the socalled covariate shift which assumes that the probability of only depends on the feature and not the label, i.e., . zadrozny2004learning (zadrozny2004learning) described an optimum bias correction method for learners in the presence of covariate shift via reweighting.Let be the true distribution over , and let be a distribution producing a dataset with an inherent covariate shift (selection bias with respect to ). Using the selection variable , this relationship is expressed as: . Recalling that under covariate shift it holds that , the true distribution can be recovered from the biased distribution as follows:
(2a)  
(2b) 
Eq. (2b) gives the relationship between the true and biased distributions and identifies the corrective element: a weighting factor given by which can be found for each instance of an unlabeled set. The quantities in the above equation need to be estimated from finite samples. An extensive work on effects of the estimation error on bias correction is given by cortes2008sample (cortes2008sample).
2.2 Calibration
Generally speaking, the output of machine learning models can be used for diagnostic or prognostic purposes. While the only important quality for diagnostic (e.g., classification) purposes is the relative ranking of the outputs (e.g., which class attains the highest score), for prognostic purposes (e.g. accuracy estimation), the ability to output a wellcalibrated quantity becomes essential. It is well known that machine learning models tend to produce more or less miscalibrated output probabilities. As a remedy, a variety of calibration techniques is available (Guo2017; Zadrozny2002; bella2010calibration). In our case, calibration is achieved by a binning mechanism as part of the performance prediction, described in Section 3.1.
3 Methods
Our overall approach to addressing the problem(s) formulated in Section 1.1 is depicted in Figure 1. The goal is to generate an accuracy estimate that is as close as possible to the true value of the entire production set. In practice, this value is unknown as there is no ground truth (labels) available. To tackle this, we investigate three techniques, separately as well as in combination: (1) Performance Prediction (described in Section 3.1), (2) Resampling (Section 3.2), and (3) Prioritization of label addition and removal (Section 3.3). The ability of the performance predictor to estimate the accuracy of a large amount of unlabeled data, while continually adjusting its model as new labels become available, plays an essential role. Making the working test set resemble (in feature distribution) the unlabeled production set as much as possible further improves the potential for an accurate estimate (see Figure 1).
3.1 Performance Prediction
Figure 2 illustrates the performance predictor’s functionality: Given a base model carrying out a classification task, the performance predictor acts as a metamodel observing the outputs (confidences) as well as the outcomes (i.e., correct vs. incorrect classifications) of the base model. In general, a performance predictor learns to predict the instancewise probability of the base model succeeding at its task, and the base model may be considered either a whitebox or a blackbox with respect to extracting various metafeatures (see Figure 1 of (chen2019AISTATS)). In this study, we consider a blackbox base model and adopt a nonparametric metamodel based on binning. The training is given in Algorithm 1
. Here, the confidence scores generated by the base model are assigned to equidistantly spaced bins. The sample mean and standard deviation of the corresponding outcomes, i.e., accuracy and spread, are then calculated in each bin. The totality of
such bins represents the predictor. Algorithm 2 exercises the predictive step: Given an instance of a confidence score, the predictor returns the mean and spread looked up in the corresponding bin. To obtain the cumulative (batch) estimate of accuracy, the instancewise predictions are averaged over the batch.The above performance predictor naturally combines two functions, namely metalearning and calibration. In our initial scoping experiments, this calibration based setup performed equally well as more complex parametric metamodels followed by explicit calibration. We also had experimented with other calibration techniques, for instance, the isotonic regression, which, in effect, achieves adaptive binning. Our experimental evidence showed that there were no significant differences in calibration quality (Brier score) between these methods, implying a wellbehaved bin utilization of the fixed binning mechanism. In light of that we focus on the simpler of these variants to promote simplicity throughout the approach.
3.2 Test Set Resampling
A common source of discrepancy between current production data and a test set is covariate shift (CS). In CS, changes in the joint distribution of features and labels, are explained solely by changes in , i.e., , with unchanged. To counter any CS present in our setting, the SSB technique described in Section 2.1 is applied.
Due to sample selection bias, given a labeled test set and unlabeled production set , we cannot assume that . To use Eq. (2b) to correct for the selection bias, and need to be estimated. Assuming that (cortes2008sample), we estimate the former as . To calculate , we discretize the problem domain by using a classical technique of vector quantization (gray1984vector; soong1987report)
, which maps each data vector to a representative codeword. Each of our data vectors consist of features concatenated with output class probabilities obtained from the base model’s output (model confidence values). To construct codewords, we apply Kmeans clustering to the combined test and production sets, producing a set of
cluster centroids (the codewords), . For experiments with structured datasets, K is chosen to be equal to the size of data vector, and for image datasets, K=256 is used. Given a mapping from feature vectors to their respective centroids, , the probability becomes and can be estimated as where and are the number of times is encountered in the test set and the production set, respectively. Finally, the weights, used as resampling ratios here, are obtained as:(3) 
Weights falling below a certain threshold are reset to zero. This is equivalent to deleting the corresponding (overrepresented) sample from the test set. Based on the above weights, the resampling is performed as an upsampling procedure as follows: all weights, , are divided by the smallest occuring positive weight, , and rounded to closest integer to obtain the new (upsampled) count for the th sample. Note, that the weight thresholding also controls the upsampled test set size.
3.3 Prioritized Sample Addition/Removal
The performance predictor trained according to Algorithm 1 on the labeled data (test set) has two outputs for each bin: the accuracy and its standard deviation. We adopt the deviation output as a proxy for the predictor’s uncertainty. As such, the prediction uncertainty offers itself for label prioritization: To add, we first label samples with highest uncertainty. Doing so is expected to improve the performance predictor’s quality in the relevant bin (this is addressed empirically in Section 5). Similarly, to delete an element of the current test set, we prioritize samples with the lowest uncertainty values.


Data Set  Splitting Condition  # Features  Test Size  Prod Size  Train Size  # classes  Classifier 


Fashion MNIST  Class  784  1403  4210  27786  10  LeNet 
Credit Card Default  1st Month Status = 0  23  590  1770  11682  2  Random Forest 
Bank Marketing  # of employees  20  1033  3100  20460  2  Random Forest 
Bank Marketing  Contact = “Cell”  20  553  1660  10956  2  Random Forest 

4 Experimental Design
4.1 Datasets and Base Models
To study the methods described in Section 3 on a variety of tasks and data types, we adopted three datasets: (1) Bank Marketing^{1}^{1}1https://www.kaggle.com/janiobachmann/bankmarketingdataset, (2) Default of Credit Card Clients^{2}^{2}2https://www.kaggle.com/uciml/defaultofcreditcardclientsdataset, and (3) Fashion MNIST^{3}^{3}3https://www.kaggle.com/zalandoresearch/fashionmnist. Table 1 summarizes the dataset statistics, type of base model, and splitting condition (explained in Section 4.2). As base models (i.e., models performing the base task), we trained a LeNet model (lecun1998gradient) for the image classification task (Fashion MNIST), and random forest models (breiman2001random) for the others.
The LeNet model was trained for 50 epochs with a batch size of 128 and a SGD optimizer with learning rate 0.002 and momentum 0.5. The random forest models used 100 trees with a max depth of 2.
4.2 Biased Splitting
Traditional test sets collected during a onetime model validation process are ineffective if they differ significantly from incoming production data. To validate our methods with respect to distributional differences between the test and production sets, we employ a notion of a “biasing feature” to partition datasets into train, test, and production sets as follows: For a chosen biasing feature, a binary criterion is set up (“Splitting Condition” in Table 1), which is used to partition the data into bins and . We then proceed to construct train, test, and production sets by sampling a range of different proportions from these two bins. Specifically, the train and test sets are both sampled from bin and from bin , while the production set is given by the opposite, from bin and from bin , for . Figure 3 illustrates this splitting logic for one half of the symmetry. Each such split represents a practical scenario where the base model’s training conditions differ from the operational conditions in the production set. Our experiments are carried out for each split independently.
4.3 Test Set Labeling and Reporting
After obtaining the biased splits, we carve out a uniformly random subset of the production set with size equal to as a designated “pool” to be labeled (referred to as pool set). New candidates from the pool set are drawn in batches, are “labeled,” i.e. their labels are revealed, and are added to the test set. The complement of the pool set is fixed and serves as a representative of the operational domain whose accuracy we aim to approximate, as defined in Eq. (1). We continue to refer to this complement as the production set.
As the main experimental outcomes, we report the absolute difference between the trained base model’s actual accuracy, measured on the production set, i.e., using its all labels, and the output of the performance predictor which uses only the current test set as a labeled source and the production set as an unlabeled source. A contrastive comparison is made to a traditional approach taking into account the test set only (baseline). All experiments are repeated four times and averages and their spreads are reported to account for noise due to the random selection involved in data partitioning.
4.3.1 Addonly
To mimic the practical case of continuous test set update using small samples of production data at a time (after manual labeling) in our experiments, we start with the initial biased test set and gradually expand it using samples from the pool set. This expansion is performed in 40 iterations (minibatches) with each iteration having a labeling budget of labels. Hence, with a sufficiently large pool set, the test set and the production distributions will converge. In a first variant of “Addonly,” the minibatches are selected at random. In a second variant, a minibatch is formed by prioritizing samples with highest uncertainty as generated by the performance predictor (see Section 3.3). Since we set , at the end of 40 iterations our initial test set, which is biased by construction, still makes up half of all samples.
4.3.2 Adddelete
To allow for a complete domain alignment between the test set and the production set at the end of the iterative process, we consider another updating variant including sample deletion. In this “Adddelete” variant, each iteration adds a minibatch from the pool set to the test set, and removes a minibatch of equal size from the original test set. Similar to the above, we investigate two strategies for addition and deletion: random selection and a prioritized selection. In the case of deletion, samples with the smallest performance prediction uncertainty are removed first (with the intuition that such samples may be overrepresented in the test set). Since , at the end of 40 iterations, all of the samples in the initial test set will be deleted and we expect both the baseline system and the performance predictor to converge to the true accuracy of the trained model on the production set.
5 Results
Given the four biased dataset splits (as per Table 1) exercising eight different bias proportions (see Figure 3), we conduct experiments for the four labeling strategies as follows: (1) addonly with random selection, (2) addonly with prioritization, (3) adddelete with random selection, and (4) adddelete with prioritization, yielding a total of 128 experiments. The performance predictor, trained using the (labeled and gradually changing) test set, makes instancewise predictions on all production samples with subsequent averaging to obtain the accuracy estimate, in Eq. (1). In addition to the performance prediction and prioritization, the resampling method described in Section 3.2 is optionally applied. The corresponding quantization codebook is calculated at the first iteration and kept fixed from there on. With resampling active, the test set is resampled according to the weights calculated according to Eq. (3) at each iteration. Outcomes of the various experiments are presented on charts showing the absolute difference between predicted and actual accuracy (yaxis) as a function of points labeled at a given iteration (xaxis). In these charts, a curve shows the average value of four repetitions of the same experiment (see Section 4.3) and their shaded bands reflect the standard deviation over the same repetitions.
Dataset: Bank Marketing, # of employees based split. (ad): 10%90% biased split (eh): 30%70%.
As a representative example, Figure 4 shows charts for the Bank Marketing dataset with test set bias proportions of 30%70% (moderate split) and 10%90% (extreme split), and the four labeling strategies (addonly, addonly prioritized, adddelete, adddelete prioritized). Each chart contains four curves corresponding to the proposed algorithms and the baseline, as follows: (1) Performance predictor without resampling (“perfpred”), (2) Performance predictor with resampling (“perfpred (resampled)”), (3) Accuracy on the test set (“test set”), and (4) Accuracy on a resampled test set (“test set (resampled)”). As can be seen in all charts, the two methods utilizing the performance predictor, namely “perfpred (resampled)” and “perfpred”, consistently attain the lowest accuracy error as the number of labels grows. Resampling alone (“test set (resampled)”) also brings a considerable improvement over the baseline (conventional test accuracy). Furthermore, all four charts show a clear trend of convergence between the actual and predicted accuracy, as expected. Among the four, the “AddDelete” strategy and, in particular, the “AddDelete with Prioritization” strategy dominate. Note that the baseline method in Figures 3(b), 3(d), 3(f), 3(h) correspond to a classical accuracy calculation except the test set updates follow the uncertaintybased prioritization generated by the performance predictor.


Predict Method  adddelete (prioritized)  adddelete (random)  addonly (prioritized)  addonly (random)  Overall 


Average AUC / RowRank Order  
perfpred (resampled)  83.3 / 1.7  79.3 / 2.0  93.3 / 1.8  90.8 / 2.0  86.7 / 1.8 
perfpred  96.1 / 2.2  93.8 / 1.9  113.4 / 2.1  112.6 / 1.9  104.0 / 2.0 
traditional testing (resampled)  92.6 / 2.5  99.4 / 2.3  113.5 / 2.5  119.4 / 2.4  106.2 / 2.4 
traditional testing (baseline)  173.0/ 3.7  167.9 /3.7  208.6 / 3.7  219.4 / 3.8  192.2 / 3.7 
Overall  111.2  110.1  132.2  135.6  



Predict Method  Moderate Subset  Extreme Subset 


Average AUC / RowRank Order  
perfpred (resampled)  50.1 / 1.9  123.3 / 1.8 
perfpred  48.5 / 2.0  159.4 / 2.0 
traditional testing (resampled)  62.3 / 2.4  150.1 / 2.4 
traditional testing (baseline)  101.1 / 3.7  283.2 / 3.8 

The results in the Figure 4 also show the aspect of saving labeling effort: suppose we are given a tolerance band for the accuracy estimate of, say, 5% (horizontal line in the charts), representing the maximum acceptable discrepancy in accuracy. On the Bank Marketing Data, we can conclude that to satisfy the tolerance criterion of 5% we save approximately 150650 labeling operations, depending on the specific strategy and bias ratio. Another way of organizing the information shown in Figure 4 is expressing the potential labeling effort saved as a function of the accuracy difference. This tradeoff is shown in Figure 5 for the same example as in the Figure 3(d). This chart provides a user with the information about the proportion of labeling operations each of the three techniques would save (yaxis), had the user’s accuracy tolerance threshold been a certain value (xaxis). The relative saving is calculated with respect to the baseline (“test set”) accuracy as: where denotes the number of labels needed for the “testset” baseline to reach accuracy of , and is same quantity but for the contrasted technique, i.e., {perfpred, perfpred(resampled), test set(resampled)}. Note that since perfpred methods achieve less than 6% prediction error without any labels added, labeling effort saved is 100% for . To allow for more comprehensive conclusions we summarize each curve (experiment) using the area under the curve (AUC). This metric calculates the area under each curve on the coordinates exemplified in Figure 4, namely the labeling effort (xaxis) and accuracy error (yaxis), and is not to be confused with an AUC of a receiver operating curve. Since both axes correspond to a cost, small AUC values are desirable.
Results in terms of the AUC metric are summarized in Table 2. The values in this table are averages over all proportion splits ranging from 90%10% to 10%90% showing the effectiveness of each method (rows) and update strategy (columns). Since the AUC magnitude varies with each experiment due to the hardness of the individual tasks, we also compute the rank order statistics for each of the four methods (rows in Table 2) per experiment, i.e., best (worst) performing method per instance receives rank 1 (4), and report its average along with the AUC. Averages over rows and columns are also shown. Values with an asterisk are statistically significant () from the best value in the corresponding column. Statistical significance was determined using the Wilcoxon signedrank test. Several conclusions can be made from Table 2: (1) The performance predictor with resampling produces the best AUC averages overall (AUC=86.7), albeit the next alternative without resampling (AUC=104.0) seems to be not significantly different. (2) Resampling alone can improve over the traditional testing significantly, reducing the AUC from 192.2 to 106.2. (3) The “adddelete” with and without prioritization are a statistical tie and are significantly different from their “addonly” variants, indicating data removal plays an important role in an effective test set maintenance.
Figure 6 shows the overall labeling effort saved due to the individual methods, similar to the Figure 5 above, for all splits of the structured datasets, and all splits of the FMNIST dataset. Due to each dataset and split giving rise to a different range of achievable accuracies, a binning to three values, namely up to 3%, 6%, and 9%, was applied. The FMNIST dataset is plotted separately as its underlying accuracy range differs significantly from the structured datasets. As seen in Figure 6, the savings depend on the choice of tolerance and are high in the case of structured datasets (at 100% for 6+% tolerance), and somewhat lower on FMNIST (80% saving at 6% tolerance). In the FMNIST case, it also appears that resampling plays a more important role. We believe this to be a consequence of the FMNIST splitting criterion (see Table 1) directly tied to the target classes, thus producing imbalance more amenable to bias correction.
Another aspect of interest is the degree of imbalance in the proportion splits. To study this aspect, we group the splits into a ”Moderate” (splits 30%70%, 40%60%, 60%40%, 70%30%) and an ”Extreme” (10%90%, 20%80%, 80%20%, 90%10%) group. Table 3 shows the AUC and rank averages over all datasets and update strategies, with respect to the two groups. Superscripts denote statistical significance at the / levels determined via the Wilcoxon signedrank test, as above. While these results confirm the relative ranking of the methods observed in Table 2, they also suggest that, in the more Extreme cases, the benefit of resampling gains importance, in particular in conjunction with the performance predictor (AUC=123.3). The predictor without resampling receives a significantly larger AUC of 159.4  a very different result compared to the Moderate case. A second observation relates to the magnitude of the AUC values: the Extreme subset induces AUC values that are multiples of those observed in the Moderate group, reflecting the hardness of the underlying task in more extreme mismatch scenarios. It is reassuring to observe that the relative ranking of the individual methods remains stable across the two conditions.
6 Conclusions
Despite their ubiquitous use, test sets are a poor mechanism for evaluating model quality in real world applications. Without constant monitoring and updating, test sets can drift from production traffic, making the resulting test results irrelevant and misleading. Our work directly addresses this problem by proposing a set of techniques (resampling, performance prediction, and prioritized addition/removal) which drastically reduce the effort required to achieve a given level of testing quality. On multiple datasets of different modalities, these techniques resulted in a labeling effort reduction ranging between 80% and 100%, depending on dataset, for an error tolerance of 5%.
These results were achieved using a simple confidencebinningbased performance predictor. Even without performance prediction, simply resampling the test set to match production data and employing an “add and remove” strategy showed significant improvements over a traditional test set approach. We believe this result is significant because it offers strong evidence for practitioners to move away from the time honored tradition of test sets, instead embracing performance predictors to estimate their model quality. It also encourages researchers to innovate further in this space, identifying new techniques for measuring model quality more accurately and with lower cost.
Comments
There are no comments yet.