Kernel Based Detection of Mislabeled Training Examples Hamed Valizadegan∗ Pang-Ning Tan † Abstract The problem of identifying and eliminating mislabeled The problem of identifying mislabeled training examples training examples has been widely studied particularly in the has been examined in several studies, with a variety of ap- pattern recognition and machine learning literature [1, 2, 3, proaches developed for editing the training data to obtain 4, 5]. While some of these algorithms attempt to edit the better classifiers. Many of these approaches involve apply- mislabeled examples during model building, others apply noise filtering as a preprocessing step prior to executing their ing an individual or an ensemble of classifiers to the training set and filtering the mislabeled examples based on their con- model building algorithms. The latter group of algorithms, which is the focus of this study, differs in terms of the sistency with respect to the classifier’s outputs. In this study, we formulate mislabeled detection as an optimization prob- strategy used to determine whether a training example is lem and introduce a kernel-based approach for filtering the mislabeled. mislabeled examples. Experimental results using a variety Some algorithms employ local learning methods such of data sets from the UCI data repository demonstrate the as k-nearest neighbor (kNN) classification to measure the effectiveness of our proposed method, compared to existing amount of inconsistencies between the label of a training nearest-neighbor and ensemble-based filtering schemes. example and the labels of its surrounding neighbors. If there is strong evidence of discrepancy among the labels, the 1 Introduction training example is tagged as mislabeled. One problem with this approach is that it does not propagate the mislabeling The presence of noise, in the form of mislabeled training ex- information to the detection of other training examples. amples, often has an adverse effect on the performance of Each training example is examined independently, without classifiers. As the amount of mislabeled examples increases, considering the decisions made for other examples. Another the classifiers become more susceptible to the model overfit- group of algorithms applies an ensemble of classifiers to ting problem. This has led to considerable interest in devel- the training examples and detects whether the class label oping techniques for identifying and eliminating mislabeled assigned to each example is consistent with the output of examples from training data. these classifiers. The main problem with this approach According to Brodley and Friedl [1], mislabeling occurs is that the classifiers used to detect mislabeled examples due to a variety of reasons, including the subjective nature are constructed from a training set containing mislabeled of the labeling task, the mistakes made during data entry, examples. As a result, the induced model may be biased and the lack of information to determine the true label towards the mislabeled examples. The problem becomes of a given example. Subjective mislabeling may happen, even more pronounced when the level of noise in the training for instance, when experts are given the task of rating a data is high. particular observation according to their personal judgement. The use of ensemble learning approaches for mislabeled The assessments provided by some experts may disagree detection has become increasingly popular in recent years with the general consensus, which lead to mislabeling errors. [1, 5, 3]. The performance of such approaches strongly rely Another potential cause for mislabeling is due to data entry on the following two fundamental assumptions—(1) that the mistakes which occur when transforming information on errors committed by the base models are independent of each paper to computerized forms due to illegible or unclear other and (2) that the error rates of the base models should be handwriting. Finally, mislabeling errors may also arise when less than 50%. Guaranteeing that both of these assumptions there is insufficient amount of information to determine the hold is not a trivial task. This is because there are two types true class label of a given example. For example, in the of errors often associated with mislabeled detection. A Type medical domain, a physician may not be able to make the 1 error corresponds to declaring a correctly labeled example right diagnosis unless certain expensive medical procedures as mislabeled, while a Type 2 error corresponds to declaring have been performed on a patient. a mislabeled example as correctly labeled. As will be shown in our experiments, the Type 2 errors in the base models may ∗ Michigan State University. Email:
[email protected]exceed 50% for many data sets, especially when the noise † Michigan State University. Email:
[email protected]level is sufficiently high. Kernel methods are currently at the heart of many and to remove other suspicious examples whose class labels machine learning algorithms. These methods are based on cannot be verified with high certainty. In NCN [7], the near- the idea of projecting the data into a higher dimensional est neighbors are chosen not only based on their proximity, embedding, known as the Hilbert space, and searching for but also whether they are symmetrically placed around the linear relations in the transformed space. The projection is given example. One problem with local learning methods performed implicitly and specified by a kernel function that is that they do not propagate the mislabeling information to facilitates the efficient computation of inner product between other examples in the data set. In other words, each train- any two given examples. These methods have shown great ing example is examined separately, without considering its promise in solving a variety of clustering, classification, and effect on other examples. dimensionality reduction problems. Ensemble-based methods [8, 5, 3] assume that the mis- In this paper, we present a novel kernel-based approach labeled examples often produce conflicting class labels when for detecting mislabeled training examples. Unlike other multiple, independent classifiers are applied. Algorithms approaches, we directly formulate mislabeled detection as that belong to this category vary in terms of how the dif- a constrained optimization problem and introduce a kernel- ferent classifiers are constructed. Brodley and Friedl [1] based strategy to find mislabeled examples. Experimental re- used leave-one-out cross validation to generate an ensemble sults indicate that our proposed method can effectively pro- of classifiers. They also proposed two approaches to detect duce higher detection rate and lower false alarm rate com- mislabeled examples—consensus filter and majority vote. A pared to existing ensemble and nearest neighbor methods, consensus filter tags an example as being mislabeled only when applied to many real world data sets. if it is misclassified by all the classifiers in the ensemble. The remainder of the paper is organized as follows. Sec- A less conservative method is to consider an example to be tion 2 presents some preliminary background with related mislabeled if it disagrees with the majority vote of the clas- work. Section 3 describes our proposed kernel-based misla- sifiers. Verbaeten and Assche [9] considered an ensemble beled detection approach. Experimental results are reported method for mislabeled detection based on boosting and bag- in Section 4. Finally, we conclude with a summary of our ging. John proposed an iterative method to remove exam- contributions and directions for future research. ples by pruning a decision tree algorithm and then to rebuild a new tree from the reduced data set [4]. Zhu et al. [5] in- 2 Background troduced a method based on partitioning a large data set into Let D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} be a collection of smaller subsets and building a corresponding classifier for n training examples, where each xi is an instance of the input each subset. Jiang et al. [3] considered an ensemble of neu- space ℜd and yi ∈ {−1, +1} denotes its corresponding class ral networks for mislabeled detection while Venkataraman et label. The objective of mislabeled detection is to identify a al. [10] employed an ensemble approach using support vec- subset of training examples in D whose labels are incorrect. tor machine trained on different feature subsets of the data. In the remainder of this section, we briefly review the related One problem with these ensemble-based methods is that the work in this area. underlying models are built from a training set that contains mislabeled examples. Another problem is the requirement 2.1 Related Work that each base classifier must be independent and has type 1 and type 2 error rates of less than 50%, which may not hold Recent years have witnessed growing interest in developing when the level of noise is sufficiently high. effective methods for detecting mislabeled examples in the In the past decade, kernel-based learning has found training data. Similar to our proposed approach, most of wide applications in a variety of data mining and machine the methods are designed to deal with random noise, instead learning problems, including classification, clustering, and of systematic mislabeling errors. The existing methods dimensionality reduction [11, 12, 13, 14, 15]. Through the differ in terms of the strategies used to determine the true use of a positive semi-definite kernel function, kernel-based classification of the training examples. learning exploits the inherent structure of the data and thus Local learning methods [2, 6] assume that the class la- produces highly effective results. Nevertheless, to date, we bels of mislabeled examples tend to disagree with the class are not aware of any work that uses kernel-based learning for labels of other examples in their surrounding neighborhood. mislabeled detection. Sanchez et al. [2] used several variations of the kNN ap- One challenging aspect of kernel-based learning is the proach, including depuration and nearest centroid neighbor- difficulty of choosing the appropriate kernel function as well hood (NCN), to detect mislabeled examples. Depuration, as its corresponding kernel parameter [16, 17, 14, 18]. Al- which was initially introduced by Marques et al. in [6], is an though many algorithms require the user to set the parame- iterative process to modify examples whose class labels dis- ters manually, there have been several recent studies devoted agree with the class labels for most of their nearest-neighbors to the automatic selection of kernel parameters. Shi and Ma- lik [17] recommended using a kernel width of 10% to 20% This quantity also ranges between [-1,1] and can be of the maximum distance between examples. Zelnik-Manor interpreted as follows: and Perona [16] proposed a self-tuning approach to compute the local kernel width of a data point xi based on its distance 1. When δi is close to -1, the class labels of the nearest to the kth nearest neighbor. Another promising strategy is to neighbors strongly disagree with yi . learn the kernel matrix directly from data. Examples of such 2. When δi is close to +1, the class labels of the nearest methods include kernel alignment [19, 20], semi-definitive neighbors strongly agree with yi . programming [21], spectral graph partitioning [14] and un- supervised learning of kernel matrix[15]. 3. When δi is close to zero, the nearest neighbors have equal representations from both classes. 3 Methodology Therefore, δi can be used as a measure for detecting misla- We begin this section with an introduction to the weighted beled examples. The smaller the value of δi , the more likely k-nearest neighbor classifier, which is the building block of xi is mislabeled. our proposed kernel-based mislabeled detection algorithm. We also show how WkNN can be extended for handling the 3.2 Kernel-based Detection of Mislabeled Samples mislabeled detection problem. Finally, in Section 3.2 we (KBDMS) discusses the limitations of using WkNN for mislabeled de- tection, leading to the development of our proposed kernel- The weighted kNN approach described in the previous sec- based learning algorithm called KBDMS. tion has several inherent limitations. First, according to Equation (3.1), it predicts the class of a given example based 3.1 Weighted-kNN approach (WkNN) on the assumption that other examples in its local neighbor- hood are correctly labeled. Second, as can be seen from A k nearest neighbor (kNN) classifier predicts the class label Equation (3.3), the decision whether a given example is mis- of an example based on the majority vote of class labels labeled has no impact on the analysis of other examples1 . In among its k nearest neighbors. One potential drawback with other words, mislabeled detection is performed locally with- this approach is that finding the appropriate value for k is out propagating the decision to other examples. This makes not a trivial task. Many algorithms employ the leave-one-out the detection process simple but not quite as effective. This is cross-validation approach to determine the right value for k, because the incorrect label of a previously known mislabeled but such an approach is not only time-consuming, they do example is still being used to detect mislabeling of other ex- not guarantee that the optimal k is chosen for the particular amples in its neighborhood. A more effective strategy is to data set. propagate information about the correctness of a class label Instead of using simple majority voting, the weighted to all the neighboring examples, but this is not a trivial task. kNN approach takes into account the similarity between One way to propagate the decision is to replace each yj examples. More specifically, the class label for the training on the right-hand side of Equations (3.1) and (3.3) with its example xi is determined as follows: corresponding predicted class, yˆj . However, since this af- P fects the prediction of other examples, the process must be j∈N (xi ) yj wij repeated for all the examples. Instead of iteratively com- (3.1) yˆi = P j∈N (xi ) wij puting yˆi in each round, we formulate mislabeled detection directly as a quadratic optimization problem. where W = [wij ] corresponds to the similarity matrix be- Let pi be the probability that a training example xi is tween all the training examples and N (xi ) is the neighbor- mislabeled. Therefore, the expected value for yi is: hood formed by the k closest training examples of xi . The predicted class yˆi ranges between -1 and +1. Note that, de- (3.3) E[yi ] = −pi yi + (1 − pi )yi = (1 − 2pi )yi , pending on the choice of similarity function, it may also be possible to extend the summation over all the training exam- which −yi is the actual class if xi is mislabeled. ples, instead of restricting it only to the k nearest neighbors. We may derive a similar quantity as Equation (3.3) To measure the extent of which the predicted class of which includes the probability that the training examples are xi departs from its actual class yi , the following margin-like mislabeled. However, we need to consider the following four quantity [22] can be defined: situations: P 1. Both xi and its neighbor xj are correctly labeled. j∈N (xi ) yj wij (3.2) δi = y i P j∈N (xi ) wij 1 Otherwise, the right hand side of Equation (3.3) should involve the δj s = yi yˆi or yˆj s of its neighbors. 2. xi is wrongly labeled but xj is correctly labeled. to the following transformation: (A + A′ )/2. The resulting symmetric matrix is positive semi-definite, which leads to a 3. xi is correctly labeled but not xj . convex objective function. Nevertheless, because this is a 4. Neither xi nor xj are correctly labeled. maximization problem, the solution to (3.6) lies along the boundary, i.e., ∀i : pi ∈ {0, 1}. Such a formulation yields This can be accomplished by replacing each yi with its a hard classification of the training examples, i.e., each xi is corresponding expected value. The quantity δi becomes: classified as either mislabeled or correctly labeled. P It is also worth noting that the objective function given j∈N (xi ) (1 − 2pj )yj wij in (3.6) is maximized by declaring all the examples from one (3.4) δi = (1 − 2pi )yi P j∈N (xi ) wij of the two classes as mislabeled. This is because, if all the training examples are from the same class, there is perfect The above formulation can be generalized by extending the consistency between the class label of an example and the neighborhood N (xi ) to include all the training examples. class labels of its surrounding neighborhood. Clearly this Note that, by replacing each yi with its expected value, is not our desired solution. To overcome this problem, a the quantity δi becomes a measure of consistency between regularization penalty term is introduced to avoid declaring the expected value of yi and the expected value of the too many examples as being mislabeled. Our objective predicted class, E(ˆ yi ). Our objective now is to learn the function is now modified as follows: probability vector, P = (p1 , p2 , ..., pn )′ such that δi is maximized for all n training examples. In other words, we (3.7) max (1 − 2P )′ A(1 − 2P ) − C||P ||1 , P ∈[0,1]n would like to solve the following optimization problem: X where C is a parameter that controls the tradeoff between (3.5) max n δi maximizing the consistency of the class labels with their P ∈[0,1] i P corresponding neighborhood examples (i.e., the first term) j (1 − 2pj )yj wij and minimizing the number of mislabeled examples (i.e., the X = max (1 − 2pi )yi P P ∈[0,1]n i j wij second term). X X (1 − 2pi )yi wij yj (1 − 2pj ) Rearranging the terms in (3.7), we obtain: = max n P P ∈[0,1] i j j wij (3.8) max n 4P ′ AP − (2eA + 2eA′ + Ce)P P ∈[0,1] To compute the solution for the preceding optimization problem, it is useful to rewrite the objective function into the where e is a row vector with all its elements being one. No- following matrix notation: tice that the objective function in (3.8) is in quadratic form, which can be solved efficiently using the Newton method de- max (1 − 2P )′ diag(Y )W D−1 diag(Y )(1 − 2P ) scribed in [23]. Specifically, because the constraints in this P ∈[0,1]n formulation involve only the lower bound and upper bound for P (it does not include other equality or inequality con- where W corresponds to an n × n similarity (kernel) matrix, straints), it can be solved very efficiently using the algorithm Y is an n × n diagonal matrix whose diagonal element presented in [23] for large-scale problems. Yii = yi , and D is an n × n diagonal matrix with diagonal In addition, because the formulation stated in (3.8) elements Xn depends only on the similarity between training examples, it Dii = wij , i = 1, 2, . . . , n. is essentially a kernel-based method, allowing the problem to j=1 be solved in a high dimensional space known as the Hilbert To simplify the expression, let space. A = diag(Y )W D−1 diag(Y ). 3.3 Model Parameters Therefore the objective function to be maximized can be Our kernel-based mislabeled detection method requires the written as follows: user to specify two parameters, the constant C and the ker- nel parameter used to compute the similarity matrix W . For (3.6) max n (1 − 2P )′ A(1 − 2P ) example, if we use an RBF kernel function, the latter param- P ∈[0,1] eter corresponds to the kernel width, σ. Note that both of It is important to realize that the matrix A in (3.6) is these parameters (C and kernel parameter) are also present in not symmetric. However, following the approach in [22], other kernel-based learning formulations, including support we may approximate it with a symmetric matrix according vector machines (SVM). In this section, we describe the roles of these parameters in our formulation and present some of As a result, there are two versions of the kernel-based the possible methodologies for estimating them. learning algorithm investigated in this study: KBDMS1, As previously noted, the parameter C controls the num- which uses the fixed kernel width approach described above, ber of mislabeled examples in KBDMS. Setting C = 0 and KBDMS2, which uses the variable width approach. would result in declaring all the training examples from one of the two classes as mislabeled. This is not the desired so- 4 Experimental Evaluation lution because it assumes all the training examples should This section presents our experimental setup and the results belong to the same class. On the other hand, choosing a very obtained when applying KBDMS to a variety of data sets. large value for C yields a solution where all the training ex- The objectives of our experiments are: (1) to compare the amples are assumed to be correctly labeled, i.e., ∀i : pi = 0. performance of KBDMS against other existing algorithms In support vector machines, the parameter C is typi- (Section 4.2), (2) to illustrate the effect of applying different cally chosen using model selection methods such as ten-fold noise levels (Section 4.3), (3) to illustrate the improvements cross-validation on training data. Unfortunately, such a strat- in classification accuracy when the mislabeled examples are egy might not be directly applicable to mislabeled detection removed (Section 4.4), and (4) to illustrate the effect of Type- unless some of the mislabeled examples are known. How- 1 and Type-2 errors in ensemble methods on mislabeled ever, it may be easier to identify a sample of correctly labeled detection (Section 4.5). examples and then artificially inject noise into the sample. We can then apply KBDMS to the data using different pa- 4.1 Experimental Setup rameter values of C. The value for C that gives the highest detection rate of the artificially injected mislabeled examples To evaluate the performance of our method, we used a could be selected as our parameter. For our experiments, we variety of data sets from the UCI data repository [24]. apply the same parameter value, C = 1, on all the data sets. Table 1 shows a summary of the data sets. We artificially As will be shown in Section 4, we still obtain very accurate inject noise to the data set by modifying the class labels results even without tuning the parameter C. for some randomly chosen training examples. Similar to As mentioned in Section 2.1, there are many ways to other previous works, we do not consider mislabeling due to choose the width parameter σ of an RBF kernel function. systematic errors. Such type of mislabeling error is beyond We apply the following two strategies in this paper: the scope of this paper. Each of our experiments is repeated Fixed Width In this approach the kernel width is fixed at ten times and the average performance is reported. 10% of the variance of the distance between examples. 4.1.1 Algorithms The following mislabeled detection al- Variable Width Unlike the previous strategy, this approach gorithms are chosen for our experiments: employs a local kernel width associated with each example xi . As a result, the kernel width depends WkNN: This corresponds to the weighted kNN algorithm on the pair of examples whose similarity value is to described in Section 3.1. We used the RBF kernel be computed. To understand the rationale behind this function as the underlying similarity measure and fixed approach, let σi be the scaling parameter corresponding the kernel width to 10% of the variance of the samples to example xi . The scaled distance from xi to xj as distance. After applying the algorithm, wkNN will tag measured by xi is d(xi , xj )/σi , while the same scaled a training example xi as mislabeled if its corresponding distance as measured by xj is d(xi , xj )/σj . Therefore, margin δi (see Equation 3.3) is negative. the squared distance d2 between xi and xj can be written as d(xi , xj )/σi × d(xi , xj )/σj = d2 (xi , xj )/(σi σj ). Table 1: Description of data set. # of instances # of attributes Zelnik-Manor and Perona [16] proposed a method for Ionosphere 351 34 selecting the local scale σi based on the local statistics Vote 435 16 in the neighborhood of xi . In our experiments, we Flare 1066 10 use σi = d(xi , xK ) which is the distance between xi Digits08 352 64 and its k-th nearest neighbor (where k = 7 for all Digits38 357 64 our experiments). With this approach, the similarity Digits56 363 64 between xi and xj is given by: Digits17 361 64 · ¸ Digits27 356 64 wij = exp − d2 (xi , xj )/σi σj . Covertype 2526 10 KBDMS1: This corresponds to the KBDMS with fixed 4.2 Comparison among Mislabeled Detection Algo- kernel width method2 as described in Section 3.2. Note rithms that C = 1 for all our experiments. Tables 2 and 3 show the performance comparison (in terms KBDMS2: This corresponds to the KBDMS with variable of TPR and FPR) between KBDMS1 and KBDMS2 against width method as described in Section 3.2. Note that other baseline methods when 10% and 30% of the training C = 1 for all our experiments. examples are mislabeled. Note that KBDMS2 outperforms CCTree: CCTree stands for Consensus Cross-Validation other competing methods for all the data sets. The perfor- Tree Filter [1]. It partitions the training examples into mance improvements of KBDMS2 (in terms of both TPR 10 folds and uses 9 of the folds to build a decision tree. and FPR) are observed in every data set except for the Vote The tree building step is then repeated 9 more times, data, in which KBDMS2 has a high TPR but also a relatively each time leaving out a different partition. At the end high FPR. Comparing the results of both tables, the improve- of the process, an ensemble of 10 tree classifiers is ob- ments observed in KBDMS2 becomes even more significant tained. The ensemble is then used to classify each of the when the number of mislabeled samples increases. training examples. A given example is declared as mis- In contrast, the performance of KBDMS1 is not as good labeled if its class label disagrees with the predictions as KBDMS2, which shows that using a local kernel width is made by all the classifiers in the ensemble. more effective than using a global kernel width. This analy- sis shows that, similar to other kernel-based learning meth- MCTree: MTree stands for Majority Cross-Validation Tree ods, the performance of the method is quite sensitive to the Filter [1]. It generates an ensemble of classifiers using choice of kernel function and its corresponding parameters. the same approach as CCTree, except it uses majority For most of the data sets, the TPR values for the vote to decide whether a given example is mislabeled. ensemble method are very poor. There are two interesting More specifically, if the class label of the example points worth noting regarding this method. First, the lower disagrees with more than half of the predictions made values of TPR in both CCTree and CBTree suggest that by the classifiers, the example is declared as mislabeled. consensus filters have worse detection rates than majority CBTree: CBTree stands for Consensus Bagging Tree Filter vote filters due to their conservative decision making. The [9]. This approach is very similar to CCTree except it second point is that, in many cases, the TPR of MBTree and uses bootstrapping to create an ensemble of classifiers. MCTree is less than 50%, especially when the level of noise is high. This could explain the poor performance observed in MBTree: MBTree stands for Majority Bagging Tree Filter the ensemble method as its Type II error (which is equivalent [9]. This approach is very similar to MCTree, except it to 1 - TPR) is more than 50%. We will revisit this issue in uses bootstrapping to create an ensemble of classifiers. Section 4.5. 4.1.2 Evaluation Metrics Let D be a collection of train- 4.3 Robustness to Noise Level ing examples. Suppose M ⊆ D corresponds to the set of training examples that was wrongly labeled. Furthermore, Our next experiment is to investigate the effect of noise level let M ′ ⊆ D be the set of examples tagged by the algorithm on the performance of various algorithms. To make the plots as being mislabeled. easier to visualize, we only show the results for KBDMS2, We evaluated the performance of a mislabeled detection MBTree, and MCTree (knowing that KBDMS2 is better than algorithm in terms of its True Positive Rate (TPR) and False KBDMS1 and wKNN, and majority vote filters are better Positive Rate (FPR), which are defined as follows: than consensus filters). We use the F1 -measure to compare ′ |M ∩ M | the performance of the algorithms. TPR = Due to space considerations, we only show the results |M| for four data sets. Note that the results on other data sets also |M ′ − (M ′ ∩ M )| FPR = look very similar. For each data set, we vary the amount of |D − M | mislabeled examples from 5% up to 50%. Each experiment Another possibility is to summarize TPR and FPR into a is repeated 10 times and we report the average values of their single metric using the well-known F1 -measure, which is a corresponding F1 -measure. From Figure 1, it can be clearly widely used metric in information retrieval. seen that KBDMS2 is generally superior than the other two ensemble methods, except for the Vote data with noise levels 2 For the Digits38 data, the default width parameter in KBDMS1 and below 20%. KBDMS2 also attains 100% TPR on the digits WkNN produces a kernel matrix with most of its elements being zero! So data sets when the level of noise is less than 35%. However, we used 20 percent of variance of the distance between samples as the width similar to the other approaches, its performance starts to parameter. Table 2: Performance comparison among various algorithms (10% of examples are randomly mislabeled) . KBDMS1 KBDMS2 WkNN CBTree MBTree CCTree MCTree TPR, FPR Ionosphere .74, .08 .88, .03 .77, .19 .006, 0 .45, .02 .02, .001 .24, .01 Vote .78, .05 .83, .08 .87, .10 .15, 0 .77, .03 .38, .004 .76, .03 Flare .70, .13 .83, .17 .80, .20 .56, .05 .72, .17 .59, .08 .70, .15 Digits08 .78, .02 .99, 0 .90, .10 0, 0 .44, .006 .02, 0 .37, .01 Digits38 .83, 0.02 .88, 0.005 .91, .10 .003, 0 .43, .005 .02, 0 .32, .007 Digits56 .80, .03 .99, 0 .90, .09 0, 0 .54, .007 .008, .0003 .35, .01 Digits17 .76, .03 1.0, 0 .89, .10 .003, 0 .53, .008 .04, 0 .24, .01 Digits27 .77, .02 .97, 0.002 .91, .10 .003, 0 .50, .01 .014, 0 .35, .007 Covertype .986, .005 .9998, .0001 .998, .002 .02, 0 .35, .001 .007, 0 .26, .001 Table 3: Performance comparison among various algorithms (30% of examples are randomly mislabeled) . KBDMS1 KBDMS2 WkNN CBTree MBTree CCTree MCTree TPR, FPR Ionosphere .62, .11 .83, .05 .68, .35 .002, 0 .18, .03 .007, 0 .09, .007 Vote .52, .10 .80, .10 .71, .25 .004, .001 .45, .06 .12, .004 .42, .06 Flare .76, .45 .83, .17 .67, .32 .17, .02 .63, .20 .46, .06 .65, .16 Digits08 .56, .08 .97, 0 .67, .30 0, 0 .20, .013 0, 0 .05, .007 Digits38 .56, .07 .84, .005 .67, .28 0, 0 .17, .02 0, 0 .31, .05 Digits56 .55, .09 .98, 0 .70, .29 0, 0 .19, .01 0, 0 .07, .007 Digits17 .57, .08 1.0, 0.0 .69, .29 0, 0 .25, .015 0, 0 .08, .006 Digits27 .56, .08 .98, .004 .68, .27 0, 0 .18, .02 .006, .0003 .094, .01 Covertype .863, .025 .96, .007 .847, .11 .0003, 0 .25, .02 .0004, .0001 .11, .01 degrade when the noise level increases. It is also interesting KBDMS2 In this approach, we discard any examples that to observe that MBTree usually performs better than MCTree are found to be mislabeled by KBDMS2. We then suggesting that bootstrapping might be more effective to train a new classifier using only the remaining training create ensemble models for mislabeled detection compared examples. to the 10-fold data partitioning approach. MBTree This is analogous to the previous KBDMS2 ap- 4.4 The Effect of Removing Mislabeled Examples proach, except we use MBTree for mislabeled detec- tion. Again, we choose MBTree because it has better So far, we have shown that KBDMS2 has high detection rate performance than MCTree. and low false alarm rate even when the level of noise is high. The question is, how do the improvements in detection rates No Filter This is another baseline approach, where we train and false alarm rates translate to making better classifiers? a classifier on the training set without removing the To investigate this issue, we split the original data into mislabeled examples beforehand. two parts: a training set and a test set. The training set For this experiment, we also vary the amount of misla- contains 80% of the overall examples while the test set beled examples from 5% to 50% to illustrate the effect of contains the remaining 20%. Mislabeling errors are then noise level. We repeated the experiment 10 times for each randomly introduced to the training set (i.e., the test set noise level and obtained the average accuracy over 10 runs. remains unchanged). Figure 2 shows the results of the experiments. It is clear To illustrate the effect of removing mislabeled exam- that editing the training set by removing mislabeled exam- ples, we consider the following classifiers: ples is effective for both KBDMS2 and MBTree, but KB- No Mislabeled This corresponds to the baseline approach DMS2 makes better classifiers (even in the case of the Vote in which the training set remains unperturbed. We data set) due to its better performance at detecting mislabeled expect such an approach to produce classifiers with the examples. highest accuracy. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 F−Measure F−Measure 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 KBDMS2 KBDMS2 0.1 0.1 MBTree MBTree MCTree MCTree 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise Level Noise Level (a) Digitis08 (b) Digitis17 0.9 0.9 0.8 0.8 0.7 0.7 0.6 F−Measure F−Measure 0.6 0.5 0.4 0.5 0.3 0.4 0.2 0.3 KBDMS2 0.1 KBDMS2 MBTree MBTree MCTree MCTree 0.2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise Level Noise Level (c) Vote (d) Ionosphere Figure 1: F-measure comparison of different algorithms for datasets with different noise levels 4.5 Analysis of ensemble filters 2 error rates (in terms of detecting mislabeled examples). To capture their variability, we use a boxplot to show the In our previous experiments, we were quite surprised to distribution of errors committed on the different training observe the poor performance of the ensemble methods, sets. Each box in the boxplot summarizes the following especially in terms of their TPR. We conjectured that this descriptive statistics: median, upper and lower quartiles, is because the Type 2 errors of the classifier might be greater minimum and maximum data values. We show the boxplots than 50%. Assuming a null hypothesis in which an example for both Type 1 and Type 2 errors on four different data sets is correctly labeled, a Type 1 error corresponds to declaring in Figure 3. For each data set, we again vary the level of a correctly labeled example as mislabeled (i.e., rejecting the noise from 5% to 50%. Notice that the Type II error exceeds null hypothesis even though it is true). Meanwhile, a Type 50% most of the time especially when the level of noise is 2 error corresponds to declaring a mislabeled example as high even though its Type I error is almost always less that being correctly labeled (i.e., accepting the null hypothesis 50%. This observation is consistent with the results obtained even though it is false). in the previous sections. While the ensemble-based filters To verify our conjecture, we have created 10 different have moderately low false alarm rates, their TPR are very training sets using the bootstrap approach. We then build poor. This explains the poor performance of the ensemble 10 different classifiers and compute their Type 1 and Type method on the given data sets. 1 1 0.9 0.9 0.8 0.8 Accuracy Accuracy 0.7 0.7 0.6 0.6 0.5 No MisLabeled 0.5 No MisLabeled KBDMS2 KBDMS2 MBTree MBTree No Filter No Filter 0.4 0.4 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise Level Noise Level (a) Digitis08 (b) Digitis17 1 0.95 0.9 0.9 0.85 0.8 0.8 0.75 Accuracy Accuracy 0.7 0.7 0.65 0.6 0.6 0.55 0.5 No MisLabeled No MisLabeled KBDMS2 KBDMS2 0.5 MBTree MBTree No Filter No Filter 0.4 0.45 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise Level Noise Level (c) Vote (d) Ionosphere Figure 2: Accuracy comparison of different algorithms for data sets with different noise levels 5 Conclusion versus-one or one-versus-all methods often associated with Mislabeling of training examples is a pervasive problem un- multi-class learning. Another possible research direction is derlying many real-world data. This paper presents a novel to incorporate the multi-class information directly into the kernel-based learning algorithm for detecting mislabeled ex- objective function. Finally, we would like to expand the amples. Unlike other existing methods, which have the analysis to mislabeling resulting from systematic errors. drawback of building models from incorrectly labeled data or not propagating their results to other examples, we address 6 Acknowledgement these problems by formalizing it as an optimization problem We thank the anonymous reviewers for their valuable com- and developing an iterative approach for solving it. We show ments about the paper. that our proposed algorithm can effectively detect mislabeled examples and is very robust even when the level of noise is References very high. We also demonstrated that the poor performance of the ensemble method is due to their large Type 2 errors, which violates the assumption for most ensemble methods [1] C. E. Brodley and M. A. Friedl. Identifying mislabeled (that the errors should be less than 50%). training data. Journal of Artificial Intelligence Research 11, pages 131–167, 1999. For future work, we plan to extend our methodology [2] A. I. Marques R. Alejo J. Badenas. J.S. Sanchez, R. Baran- to multi-class problems. One possibility is to use the one- dela. Analysis of new techniques to obtain quality train- ing sets. Pattern Recognition Letteres 24, pages 1015–1022, [20] X. Zhu, J. Kandola Z. Ghahramani, and J. Lafferty. Nonpara- 2003. metric transforms of graph kernels for semi-supervised learn- [3] Z.-H. Zhou. Y. Jiang. Editing training data for knn classifiers ing. In Advances in Neural Information Processing Systems with neural network ensemble. In Lecture Notes in Computer 17, pages 1641–1648, 2005. Science 3173, pages 356–361, 2004. [21] G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, Laurent El [4] G. H. John. Robust decision trees: Removing outliers Ghaoui, and Michael I. Jordan. Learning the kernel matrix from databases. In Proceeding of International Conference with semidefinite programming. Journal of Machine Learn- on Knowledge Discovery and Data Mining, pages 174–179, ing Research, 5:27–72, 2004. 1995. [22] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kan- [5] X. WU X. Zhu and Q. Chen. Eliminating class noise in dola. Transductive learning via spectral graph partitioning. large datasets. In Proceeding of International Conference on In Proceeding og Twentieth International Conference on Ma- Machine Learning (ICML2003), 2003. chine Learning, 2003. [6] A. I. Marques R. Alejo J. Badenas. J.S. Sanchez, R. Baran- [23] S. Verbaeten and A. V. Assche. A reflective newton method dela. Decontamination of training data for supevised pat- for minimizing a quadratic function subject to bounds on tern recognition. In Advances in Pattern Recognition Lecture some of the variables. pages 1040–1058, 1996. Notes in Computer Science 1876, pages 621–630, 2000. [24] UCIrvine. Uci data repository. In [7] B.B. Chaudhuri. A new definition of neighborhood of a point http://www.ics.uci.edu/ mlearn/MLRepository.html. in multi-dimensional space. Pattern Recognition Letters 17, pages 11–17, 1996. [8] C. E. Brodley and M. A. Friedl. Improving autmated land cover mapping by identifying and eliminating mislabeled observations from training data. In Geoscience and Remote Sensing Symposium, pages 1379–1381, 1996. [9] S. Verbaeten and A. V. Assche. Ensemble methods for noise elimination in classification problems. In Proceeding of 4th International Workshop on Multiple Classifier Systems, 2003. [10] D. Metxas D. Fradkin C. Kulikowski. S. Venkataraman. Dis- tinguishing mislabeled data from correctly labeled data in classifier design. In 16th IEEE International Conference on Tools with Artificial Intelligence, 2004. [11] G. Ratsch K. Tsuda B. Scholkopf. K.R. Muller, S. Mika. An introducion to kernel-based learning algorithms. IEEE Transaction on Neural Networks 12, pages 181–201, 2001. [12] G. Ratsch K. Tsuda B. Scholkopf. K.R. Muller, S. Mika. Support vector networks. Machine Learning 20, pages 273– 297, 1995. [13] J. Weston B. Schlkopf A. J. Smola S. Mika, G. Rtsch and K.-R. Mller. Invariant feature extraction and classification in kernel spaces. In Advances in Neural Information Processing Systems 12, pages 526–532, 2000. [14] J. Weston B. Schlkopf A. J. Smola S. Mika, G. Rtsch and K.- R. Mller. Learning spectral clustering. In Advances in Neural Information Processing Systems 16, 2004. [15] H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel learning. In Advances in Neural Information Processing Systems 19, 2006. [16] L. Zelnik-Manor and P. Perona. Self-tuning spectral cluster- ing. In Advances in Neural Information Processing Systems 17, pages 1601–1608, 2005. [17] J. Shi and J. Malik. Normalized cuts and image segmenta- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [18] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Anal- ysis and an algorithm. In Advances in Neural Information Processing Systems 14, 2001. [19] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kan- dola. On kernel-target alignment. In NIPS, pages 367–373, 2001. 0.8 0.25 0.7 0.2 0.6 Type II Error Type I Error 0.15 0.5 0.1 0.4 0.05 0.3 0.2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise Level Noise Level (a) Digitis08 Error II (b) Digitis08 Error I 0.9 0.3 0.8 0.25 0.7 0.2 Type II Error 0.6 Type I Error 0.15 0.5 0.1 0.4 0.05 0.3 0.2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise Level Noise Level (c) Digitis17 Error II (d) Digitis17 Error I 0.75 0.4 0.7 0.35 0.65 0.6 0.3 0.55 0.25 Type II Error Type I Error 0.5 0.2 0.45 0.4 0.15 0.35 0.1 0.3 0.25 0.05 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise Level Noise Level (e) Vote Error II (f) Vote Error I 0.35 0.85 0.8 0.3 0.75 0.25 0.7 Type II Error Type I Error 0.65 0.2 0.6 0.55 0.15 0.5 0.1 0.45 0.4 0.05 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise Level Noise Level (g) Ionosphere Error II (h) Ionosphere Error I Figure 3: Type 1 and Type 2 errors of the base classifiers for the ensemble method.