Neural Comput & Applic (2007) 16:527–539 DOI 10.1007/s00521-007-0110-1 ICONIP2006 Classification consistency analysis for bootstrapping gene selection Shaoning Pang Æ Ilkka Havukkala Æ Yingjie Hu Æ Nikola Kasabov Received: 22 November 2006 / Accepted: 6 March 2007 / Published online: 30 March 2007 Springer-Verlag London Limited 2007 Abstract Consistency modelling for gene selection is a only a small proportion of the genes contribute to classi- new topic emerging from recent cancer bioinformatics re- fication, and the rest of genes are considered as noise. Gene search. The result of operations such as classification, selection is used to eliminate the influence of such noise clustering, or gene selection on a training set is often found genes, and to find out the informative genes related to to be very different from the same operations on a testing disease. set, presenting a serious consistency problem. In practice, the inconsistency of microarray datasets prevents many 1.1 Review of gene selection methods typical gene selection methods working properly for cancer diagnosis and prognosis. In an attempt to deal with this Selecting informative genes, as a critical step for cancer problem, this paper proposes a new concept of classifica- classification, has been implemented using a diversity of tion consistency and applies it for microarray gene selec- techniques and algorithms. Simple gene selection methods tion problem using a bootstrapping approach, with come from statistics, such as t-statistics, Fisher’s linear encouraging results. discriminate criterion and principal component analysis (PCA) [1–4]. Statistical methods select genes by evaluating and ranking their contribution or redundancy to classifi- cation [5], and are able to filter out informative genes very 1 Introduction quickly. Margin-based filter methods have also been introduced recently [6]. However, the performance of these The advent of microarray technology has made it possible methods is not satisfactory, when applied on datasets with to monitor the expression levels for thousands of genes large numbers of genes and small numbers of samples. simultaneously, which can help clinical decision making in More sophisticated algorithms are also available, such complex disease diagnosis and prognosis, especially for as noise sampling method [7], Bayesian model [8, 9], cancer classification, and for predicting the clinical out- significance analysis of microarrays (SAM) [10], artificial comes in response to cancer treatment. However, often neural networks [11], and neural fuzzy ensemble method [12]. These methods define a loss function, such as clas- S. Pang (&) I. Havukkala Y. Hu N. Kasabov sification error, to evaluate the goodness of a candidate Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Private Bag 92006, subset. Most are claimed to be capable of extracting out a Auckland 1020, New Zealand set of highly relevant genes [13], however, their compu- e-mail:

[email protected]

tational cost is much higher than in the simpler statistical I. Havukkala methods. e-mail:

[email protected]

A bootstrapping approach can also be used. This can Y. Hu select genes iteratively in a number of iterations, and can e-mail:

[email protected]

use a diversity of criterions simultaneously. For example, N. Kasabov Huerta et al. [14] proposed a GA/SVM gene selection e-mail:

[email protected]

method that achieved a very high classification accuracy 123 528 Neural Comput & Applic (2007) 16:527–539 (99.41%) on Colon Cancer data [15]. Li et al. [16] intro- derives the novel bootstrapping gene selection method duced a GA/KNN gene selection method that is capable of based on the classification consistency. Section 4 describes finding a set of informative genes, and the selected genes cancer diagnosis experiments on six well-known bench- were highly repeatable. Wahde and Szallasi [17] used an mark cancer microarray datasets and one proteomics evolutionary algorithm based on a gene relevance ranking, dataset. Finally, we present conclusions and directions for and surveyed such methods [18]. The main drawbacks of further research in Sect. 5. the bootstrapping methods are in the difficulties of devel- oping a suitable post-selection fitness function and in determining the stopping criterion. 2 Consistency concepts for gene selection 1.2 Motivation of consistency for gene selection A microarray can contain more than ten thousands genes, For a disease microarray dataset, we do not know initially but only a few samples involving different types of disease. which genes are truly differentially expressed for the Gene selection, similar to feature selection in the tradi- disease. All gene selection methods seek a statistic to find tional pattern recognition, works for selecting a set of out a set of genes with an expected loss of information genes/features to achieve better patient disease diagnosis. minimized. Most previous methods work by estimating (i.e. classification of disease). Biologists are also interested directly a ‘class-separability’ criterion (i.e. rank of con- to find out those genes informative to the disease for further tribution to classification, or loss of classification) for a disease research. better gene selection. In a different vein, reproducibility is Given a microarry dataset D pertaining to a bioinfor- addressed by Mukherjee [19] as the number of common matics classification task, consisting of n samples with m genes obtained from the statistic over a pair of subsets genes, we define Da and Db as two subsets of D obtained by randomly drawn from a microarray dataset under the random subsampling; these two subsets serve as training same distribution. and testing data, respectively. Class-separability criteria approximate the ‘ground- truth’ as the class-separation status of the training set (one D ¼ Da [ Db & Da \ Db ¼ [ ð1Þ part of a whole dataset). However, this whole dataset Provided an operation F over D such as a classification or a normally is just a subset of a complete dataset of disease (a clustering, and a gene selection function fs for selecting a dataset includes all possible microarray distributions of a subset of genes/features over training data Da to achieve disease). This leads to bad reproducibility, i.e. the classi- better disease-diagnosis/classification on testing data Db, fication system works well on the dataset that it was built the fundamental consistency concept C on gene selection fs on, but fails on future data. Reproducibility criteria take can be modelled on a pair of subsets (Da and Db) drawn advantage of certain properties of microarray data, thus from the whole microarray dataset D under the same they do not approximate the ‘ground truth’, but indirectly distribution, minimize the expected loss under true data generating distribution. CðF; fs ; DÞ ¼ jPa Pb j ð2Þ However, it is not clear to what extent the selected highly differentially expressed genes using common-gene where Pa and Pb are the outcomes of function F on D and reproducibility criterion are correlated to a substantially Db, respectively. Outcome of operation F on a subset Di good performance on the classification of microarray data. can be formulated as Pi, In other words, an erroneous cancer classification may also occur using a set of genes which are selected under the Pi ¼ Fðfs ðDi Þ; Di Þji ¼ a; b: ð3Þ criterion of common-gene reproducibility. Consistency in terms of classification performance is where i is the index of a subset, because consistency in Eq. addressed in this paper to derive a gene selection model (2) represents an comparison between Pa and Pb, i here with both good class-separability and good reproduc- represents a or b, indicating a subset for training or testing, ibility. A bootstrapping consistency method was devel- respectively. oped by us with the purpose of identifying a set of Note that F basically can be any of data processing informative genes for achieving replicably good results models, such as a common-gene computation, clustering in microarray data analysis. function, feature extraction function, or a classification This rest of the paper is organized as follows. Section 2 function, etc. F determines the feature space on which the gives the definition of classification consistency. Section 3 consistency is based on. 123 Neural Comput & Applic (2007) 16:527–539 529 2.1 Common-gene consistency Pa Mukherjee et al. [20] set F as a common-gene computa- Da Db D tion; their approach is as follows. Suppose fs is a ranking function for gene selection generating two lists of sorted genes from the two datasets. Let top-ranked genes in each Pb case be selected and denoted by Sa and Sb. Then, the Fig. 1 Procedure of computing consistency (Form 1) consistency in terms of common-gene Cg is defined as Cg ðfs ; Da ; Db Þ ¼ jSa \ Sb j ð4Þ Consistency Cg in Eq. (4) depends on the ranking function, data and number of selected genes. Hence, a greater Cg Pa value represents a more consistent gene selection. Da Db D 2.2 Classification consistency As F is assigned as a classification function, the above Pb consistency C in Eq. (2) is called a classification consis- Fig. 2 Procedure of computing consistency (Form 2) tency Cc, where Eq. (3) can be implemented as Pa ¼ Fðfs ðDÞ; Da ; Db Þ; and Pb ¼ Fðfs ðDÞ; Db ; Da Þ ð5Þ Substituting Eq. (5) into Eq. (2), we have 3 Gene selection based on classification consistency Unlike common-gene based consistency [20], classification Cc ðF; fs ; DÞ ¼ jFðfs ðDÞ; Da ; Db Þ Fðfs ðDÞ; Db ; Da Þj ð6Þ consistency needs a testing classification function F to estimate the contribution of selected genes, so that gene where fs(D) specifies D as the dataset for gene selection. Da selection seeks an optimal fs* with the following consis- in the first term of Eq. (5) is assigned for classifier training, tency Cc minimized, and Db is for testing. The second term of Eq. (5) specifies a reversed training and testing position for Da and Db, fs ¼ arg min Cc ðF; S; DÞ ð8Þ respectively. Note that a smaller Cc value here represents a fs 2F more consistent gene selection. Figure 1 illustrates the procedure of computing Eq. (5). First, the performance Pa where S is a set of currently selected genes. F refers to a is computed by one classification on training subset Da, family of gene selection functions, Cc() represents a clas- then, Pb is obtained by another classification on testing sification consistency computation that has F and fs as subset Db. classification function and gene selection function, Alternatively, Eq. (7) gives another form of the classi- respectively. fication consistency definition, which is obtained by In practice, the evaluation of consistency eventually is switching the training and testing sets of Eq. (5), a multi-objective optimizing problem, because there is a possibility that the improvement of consistency might be Cc ðF; fs ; DÞ ¼ jFðfs ðDÞ; Da ; Da Þ Fðfs ðDÞ; Da ; Db Þj: ð7Þ coupled with the deterioration of performance. This means that even if the consistency of one set of genes is Figure 2 shows the procedure of computing Eq. (7). Here, better than that of another set, the performance of clas- the classifier is trained on Db, and then the performance is sification P on microarray data may not be as good as computed on the other subset Da. The important difference we expect. In other words, it might be a case of con- to the procedure in Fig. 1 is that the testing and training sistently bad classification (with low classification accu- subsets are switched. Ideally, when doing the analysis, one racy). Therefore, a ratio of consistency to performance R should use both procedures, to check which dataset gives is used for the purpose of optimizing these two variables better consistency when used for training. This is to safe- simultaneously, guard for the training dataset having some kind of bias resulting in suboptimal results, when the training dataset is Cc R¼ ð9Þ not a truly random sample of all data. wP 123 530 Neural Comput & Applic (2007) 16:527–539 A B D trn + D tst D trn Gene selection Gene selection Classifier Classifier training training Dtrns Dtsts Dtrn s Dtsts Classification Classification Classification Classification accuracy accuracy Fig. 3 Comparison between a biased and a totally unbiased used twice in gene selection and classifier procedures, which creates a verification scheme, where Dtrn and Dtst are training and testing sets, bias error in the final classification results. In case b (the totally and Dtrns and Dtsts are training and testing set with genes selected, unbiased scheme), the testing set is independent of the gene selection respectively. In case a (biased verification scheme), the testing set is and classifier training procedures where w is a pre-defined weight (relative importance of Cc consistency and performance are optimized simulta- and P) for adjusting the ratio in experiment, and P is the neously. performance evaluation on dataset D. In this sense, Eq. (8) can be rewritten as, fs ¼ arg min RðF; S; DÞ ð10Þ fs 2F where S is a set of currently selected genes. Function Cc( ) in Eq. (8) is replaced by a desired R to ensure a good bal- ance between consistency and classification performance. 3.1 Bootstrapping gene-selection algorithm based on classification consistency The algorithm can be simply summarized into the follow- ing steps: 1. Split all genes of dataset D into N segments based on their mean value. 2. Randomly select one gene from each of N segments. The initial candidate gene set contains N genes and is denoted by S. 3. Apply the operation function F (i.e. classifier) to the data containing those genes listed in S, and compute the consistency Cc by Eq. (5) or Eq. (6). 4. Perform gene selection function fs on S to get a new generation of genes S¢, and recompute the consistency C¢c. 5. If C¢c > Cc, then Cc = C¢c and S = S¢. 6. Repeat Steps 3–5 for until Cc becomes smaller than a predefined threshold value. The optimized gene selection is obtained by generations 7. Output the finally selected genes. of optimization on consistency and classification perfor- mance. In each generation, Da and Db are resampled B Algorithm 1 presents the Bootstrapping consistency times depending on the size of samples. For example, if the method for gene selection in pseudo-code, where sample size of dataset is larger then 30, B is set to 50, 123 Neural Comput & Applic (2007) 16:527–539 531 Table 1 Cancer data sets used Cancer Class 1 vs. Class 2 Genes Train data Test data Ref. for testing the algorithm. Columns for training and Lymphoma(1) DLBCL vs. FL 7,129 (58/19)77 – [21] validation data show the total number of patients Leukaemia ALL vs. AML 7,129 (27/11)38 34 [22] CNS cancer Survivor vs. Failure 7,129 (21/39)60 – [23] Colon cancer Normal vs. Tumour 2,000 (22/40)62 – [15] Ovarian cancer Cancer vs. Normal 15,154 (91/162)253 – [24] The numbers in brackets are the Breast cancer Relapse vs. Non-Relapse 24,482 (34/44)78 19 [25] ratios of the patients in the two Lung cancer MPM vs. ADCA 12,533 (16/16)32 149 [26] classes otherwise 30. Consequently, C is the mean value of the value of the computed consistencies is taken as the final consistency scores for B rounds of computation. result. In our example we use the above bootstrapping consis- tency gene selection method and Eq. (6) for consistency 4 Cancer diagnosis experiments evaluation. All genes of a given microarray dataset (the search space) are first segmented into N segments, and N is 4.1 Datasets set as 20. For each fold of the given dataset, the dataset is initially partitioned into training and testing set, on which The proposed concept for gene selection is applied to six the bootstrapping runs generation optimization until R be- well-known benchmark cancer microarray datasets and one comes less than a predefined threshold n, and the selected proteomics dataset. Table 1 summarizes the seven datasets informative genes are output. There are two setting choices used for gene selection in the experiment. for resampling times B. Depending on the size of dataset, B is set as 50 for those datasets with more than 30 samples, 4.2 Experimental setup and B is set as 30 for the datasets with smaller sample sizes. w in Eq. (9) is set as 5.0 indicating that consistency is made As suggested in literature for estimating generalization more important than performance in the optimization. n is error [27, 28], a fivefold cross-validation schema is used in set as 0.1. all the experiments, except for those datasets which had available originally separated training and testing sets. For 4.3 Results and discussion each cross validation, a totally unbiased verification scheme shown in Fig. 3b is used, where both gene selec- Experiments are presented in this section to verify the tion and classification are working only on the training set, classification consistency and bootstrapping gene selection so that no testing information is included in any part of the method. The experiments use seven benchmark datasets, cancer diagnosis modelling. six cancer microarray datasets and one proteomics dataset, For consistency evaluation, the dataset is randomly and then compare them with the experimental results of partitioned into two subsets. One subset contains one- these datasets reported in the original publications, in terms third of all samples, and the other subset has the rest of the cancer diagnosis prediction accuracy (refer to the two-thirds of samples. Using a classifier such as KNN or cited papers in Table 1). SVM, two classification accuracies can be computed on two subsets, respectively, the absolute difference between 4.3.1 Lymphoma data these two accuracies is defined as the consistency (C) in terms of classification performance (refer to Eq. 5 and Table 2 shows the bootstrapping classification results for Eq. 6). After several hundred iterations, the mean Lymphoma data, and Fig. 4 illustrates the optimizing Table 2 The classification Lymphoma Number of selected genes TP TN FP FN Classification accuracy (%) validation results on Lymphoma data Fold1 36 8 10 0 1 94.74 Fold2 25 12 6 0 1 94.74 Note that fivefold cross- validation is used for calculating Fold3 34 11 7 1 0 94.74 classification accuracy Fold4 36 10 9 0 0 100 TP True positive, TN true Fold5 32 10 8 0 1 94.74 negative, FP false positive, and Overall accuracy: 95.84% FN false negative 123 532 Neural Comput & Applic (2007) 16:527–539 Fig. 4 The optimization results Fold1 - Classification result on Lymphoma data Fold2 - Classification result on Lymphoma data on lymphoma data, where horizontal axis represents the 1 1 optimizing rounds, and vertical axis shows the results of 0.8 0.8 consistency (C), classification value value performance (P) and the ratio 0.6 0.6 (R) of consistency and C: Consistency performance calculated in the 0.4 0.4 P: Performance optimizing process. Note that C: Consistency R: ratio = P/(C) P: Performance accuracy P is the training 0.2 R: ratio = P/(C) 0.2 classification accuracy obtained in the classifier optimizing 0 0 process 0 2 4 6 8 10 12 0 5 10 15 Optimizing rounds Optimizing rounds Fold3 - Classification result on Lymphoma data Fold4 - Classification result on Lymphoma data 1 1 0.8 0.8 value value 0.6 0.6 0.4 0.4 C: Consistency C: Consistency P: Performance 0.2 P: Performance R: ratio = P/(C) 0.2 R: ratio = P/(C) 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 11 Optimizing rounds Optimizing rounds Fold5 - Classification result on Lymphoma data 1 0.8 value 0.6 0.4 C: Consistency P: Performance R: ratio = P/(C) 0.2 0 0 1 2 3 4 5 6 7 8 9 Optimizing rounds procedure in fivefold cross-validation, where consistency Figure 4 presents the optimizing procedure of boot- and classification accuracies are recorded at every opti- strapping consistency gene selection method. The opti- mizing step. mized consistency is seen to decrease to below 0.1, while As shown in Table 2, the overall classification accuracy the training classification accuracy increases to above 90%. on the testing set of Lymphoma dataset is fairly high It shows that the proposed method is capable of improving (greater than 95%). The number of selected informative genes is around 30, and the final calculated classification Table 3 The classification validation result on Leukaemia data. An accuracy is stable (94.74– 100%). Moreover, the results of independent test dataset was used for validation confusion matrix (TP, TN, FP and FN) show that the dataset Number of selected TP TN FP FN Classification proposed method is very effective on Lymphoma dataset in genes accuracy terms of both classification accuracy (TP and TN) and misclassification rate (FP and FN). Leukaemia 35 12 20 0 2 94.12% 123 Neural Comput & Applic (2007) 16:527–539 533 Classification result on Leukaemia data (train/test) on CNS cancer data. Table 4 shows that the classification results on 5 folds of CNS Cancer dataset have high vari- 1 ance: the highest accuracy is 83.33%, while the lowest is only 41.67%. The overall accuracy is only 65%, which is 0.8 not acceptable for solving the real clinical problem of disease diagnosis. The confusion matrix clearly shows that one misclassification rate (FN) is high, i.e. number of false value 0.6 C: Consistency P: Performance negatives (FN) obtained on fold2 and fold3 is 5 individu- R: ratio = P/(C) als; this is larger than the TN accuracy rate. 0.4 Figure 6 shows that the initial consistency value of CNS cancer dataset is quite high (around 0.4) and cannot be 0.2 decreased in the optimizing process as much as in the previous datasets. The classification performance on 0 training sets on folds 1–4 data rises approximately from 60 0 5 10 15 to 80%, while the consistency is decreased from 0.4 to 0.2. Optimizing rounds Although the accuracy on the fold 5 data is significantly Fig. 5 The results of optimization on leukaemia data improved, from 40 to 80%, the best consistency is still greater than 0.2, which means the consistency is not sat- isfactory, probably due to inherent problems with the Table 4 The classification validation results of bootstrapping con- dataset. Such a situation results in the bad overall classi- sistency method on CNS Cancer data fication accuracy (65.00%). CNS Number of TP TN FP FN Classification Cancer selected genes accuracy (%) 4.3.4 Colon data Fold1 44 9 1 2 0 83.33 Fold2 56 4 3 0 5 58.33 Table 5 and Fig. 7 show the experimental results obtained Fold3 43 3 2 2 5 41.67 by the bootstrapping consistency method on Colon cancer Fold4 44 7 2 3 0 75.00 data. As presented in Table 5, the highest classification Fold5 44 6 2 4 0 66.67 accuracy (91.67%) is obtained on fold 1 and fold 4 data in Overall accuracy: 65.00 the classifier optimizing process, while the lowest one (66.67%) appears on fold 3. The difference between these computed classification accuracies is large, which shows consistency simultaneously with classification perfor- the Colon Cancer dataset has a relatively high variability of mance. Note that a smaller consistency value indicates a consistency. The final number of selected informative better consistency characteristic of data. genes is on average 23, and the overall classification accuracy is about 84%. 4.3.2 Leukaemia data Figure 7 shows that both the consistency and perfor- mance improve significantly. For example, in fold 1, the Table 3 and Fig. 5 present the classification and consis- classification performance rises approximately 10% (from tency results obtained by the described bootstrapping 80 to 90%) coupled with the improvement of consistency consistency method on Leukaemia data. Table 3 shows that (from 0.2 to 0.1). The improvement of classification per- the achieved classification accuracy on the testing set is formance obtained on fivefolds data is quite different, about 95%, when 35 genes are used for constructing the though: the performance on folds 3–5 is improved much final optimized classifier. In Fig. 5, after 15 rounds opti- more than that on folds 1–2. Meanwhile, the optimizing mization based on the improvement of ratio R to consis- rounds are also different. The classifier is optimized within tency and classification performance (refer to Eq. (9)), the 25 rounds in the cases of folds 3–5, but in folds 1–2, the classification accuracy on the training set improves to 1 and classifier is optimized in less than 20 rounds. the consistency value is reduced to 0, indicating that the maximum possible consistency is obtained. 4.3.5 Ovarian data 4.3.3 CNS cancer data Table 6 and Fig. 8 give the experimental results obtained Table 4 and Fig. 6 present the experimental results ob- by iterative bootstrapping method on Ovarian Cancer tained by the described bootstrapping consistency method dataset. Table 6 shows the classification results based on 123 534 Neural Comput & Applic (2007) 16:527–539 Fig. 6 The results of iterative Fold1 - Classification result on CNS cancer data Fold2 - Classification result on CNS cancer data bootstrapping optimization on CNS cancer data 1 1 0.8 0.8 value value 0.6 0.6 C: Consistency C: Consistency P: Performance P: Performance R: ratio = P/(C) R: ratio = P/(C) 0.4 0.4 0.2 0.2 0 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 18 Optimizing rounds Optimizing rounds Fold3 - Classification result on CNS cancer data Fold4 - Classification result on CNS cancer data 1 1 C: Consistency P: Performance 0.8 0.8 R: ratio = P/(C) value value 0.6 0.6 C: Consistency P: Performance 0.4 R: ratio = P/(C) 0.4 0.2 0.2 0 0 0 2 4 6 8 10 12 14 16 18 20 22 0 5 10 15 20 25 Optimizing rounds Optimizing rounds Fold5 - Classification result on CNS cancer data 1 0.8 value 0.6 C: Consistency 0.4 P: Performance R: ratio = P/(C) 0.2 0 0 5 10 15 20 25 30 Optimizing rounds the informative genes selected. The proposed method produces an overall accuracy of 98.80%. The difference between the highest accuracy (100%) and the lowest Table 5 The classification validation results of iterative bootstrap- accuracy (98%) is only 2%. Moreover, the confusion ma- ping method on Colon Cancer data trix shows both the classification accuracy rate and mis- classification rate are very good, e.g. there are no Colon Number of TP TN FP FN Classification Cancer selected genes accuracy (%) misclassified samples in the cases of fold 4 and fold 5. Figure 8 shows that both the classification performance Fold1 22 4 7 0 1 91.67 and consistency is stable during the process of classifier Fold2 17 4 6 2 0 83.33 optimization. It turns out that the ovarian dataset has a Fold3 21 2 6 1 3 66.67 good and little varying consistency characteristic, which Fold4 29 5 6 1 0 91.67 results in successful classification results on all cross-val- Fold5 28 1 11 0 2 85.71 idation sets. Consequently, the improvement of consistency Overall accuracy: 83.81% is less than 0.05 in all 5 folds. 123 Neural Comput & Applic (2007) 16:527–539 535 Fig. 7 The results of iterative Fold1 - Classification result on Colon data Fold2 - Classification result on Colon data bootstrapping optimization on colon cancer data 1 1 0.8 0.8 0.6 0.6 value value C: Consistency 0.4 C: Consistency 0.4 P: Performance P: Performance R: ratio = P/(C) R: ratio = P/(C) 0.2 0.2 0 0 0 5 10 15 0 2 4 6 8 10 12 14 16 Optimizing rounds Optimizing rounds Fold3 - Classification result on Colon data Fold4 - Classification result on Colon data 1 1 0.8 0.8 0.6 0.6 value value 0.4 C: Consistency 0.4 P: Performance C: Consistency R: ratio = P/(C) P: Performance R: ratio = P/(C) 0.2 0.2 0 0 0 5 10 15 20 25 0 5 10 15 20 25 30 Optimizing rounds Optimizing rounds Fold5 - Classification result on Colon data 1 0.8 0.6 value 0.4 C: Consistency P: Performance R: ratio = P/(C) 0.2 0 0 5 10 15 20 25 30 Optimizing rounds 4.3.6 Breast cancer data procedure is 80%, when the final optimized consistency (approximately 0.2) is achieved after nine iterations. Table 7 and Fig. 9 show the experimental results obtained by the bootstrapping consistency method on Breast Cancer 4.3.7 Lung cancer data dataset. Table 7 shows that the low classification accuracy on the testing set is related to the high inconsistency Table 8 and Fig. 10 show the results obtained by the characteristic of this breast cancer dataset. The classifica- bootstrapping consistency method on Lung Cancer data. As tion accuracy obtained by iterative bootstrapping method shown in Table 8, the experimental result of Lung Cancer with 50 selected informative genes is only 63.16%, which data reaches a satisfactory level in which the classification is not very useful for identifying disease patterns in real accuracy on testing set is 91.28% with 34 selected genes clinical practice. identified by our method. Figure 9 presents the relatively poor consistency and As shown in Fig. 10, the classifier is only optimized in classification accuracy obtained by iterative bootstrapping only 9 rounds. Unlike in the relative poor consistency and method in the optimizing process. The best classification classification performance in the Breast Cancer dataset, performance on the training data in gene selection there was no difficulty in the optimizing process here, due 123 536 Neural Comput & Applic (2007) 16:527–539 Fig. 8 The results of iterative Fold1 - Classification result on Ovarian data Fold2 - Classification result on Ovarian data bootstrapping optimization on ovarian data 1 1 0.8 0.8 value value 0.6 0.6 C: Consistency 0.4 P: Performance 0.4 C: Consistency R: ratio = P/(C) P: Performance R: ratio = P/(C) 0.2 0.2 0 0 0 2 4 6 8 10 12 0 5 10 15 20 25 Optimizing rounds Optimizing rounds Fold3 - Classification result on Ovarian data Fold4 - Classification result on Ovarian data 1 1 0.8 0.8 value value 0.6 0.6 C: Consistency 0.4 P: Performance 0.4 C: Consistency R: ratio = P/(C) P: Performance R: ratio = P/(C) 0.2 0.2 0 0 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 Optimizing rounds Optimizing rounds Fold3 - Classification result on Ovarian data 1 0.8 value 0.6 C: Consistency 0.4 P: Performance R: ratio = P/(C) 0.2 0 0 2 4 6 8 10 12 Optimizing rounds to the already good initial consistency characteristic of results in Table 9. Consistency method outperforms the Lung Cancer dataset. It can be seen that the initial classi- published methods on four datasets, and the classification fication accuracy is greater than 90%, and the consistency result on colon data is very close to the reported accuracy. calculated in the first round is about 0.1, so that it only However, the classification accuracies of two datasets takes 9 optimizing rounds to achieve a high classification (CNS, Breast) are much lower than the published ones. accuracy coupled with a good consistency in the training Many published classification results are based on a biased process. validation scheme as in Fig. 2, which results in the experiments being unreplicable and too optimistic. 4.4 Classification accuracy summary: consistency However, the experimental results obtained by the method versus publication consistency method can be easily reproduced, because of the totally unbiased validation scheme of Fig. 3b, as ap- For clarity, the classification accuracies obtained by the plied in this study. These results suggest that a reproducible presented bootstrapping consistency gene selection method prognosis is possible for only five of the seven used are summarized and compared with the literature reported benchmark datasets. 123 Neural Comput & Applic (2007) 16:527–539 537 Table 6 The classification validation results of iterative bootstrap- Classification result on Lung(train/test) data ping method on Ovarian Cancer data Ovarian Number of TP TN FP FN Classification 1 cancer selected genes accuracy (%) Fold1 18 25 24 0 1 98.00 0.8 C: Consistency Fold2 28 31 18 1 0 98.00 P: Performance R: ratio=P /(C) Fold3 24 33 16 1 0 98.00 0.6 value Fold4 24 34 16 0 0 100 Fold5 34 38 15 0 0 100 0.4 Overall accuracy 98.80 0.2 Classification result on Breast Cancer data 0 1 0 1 2 3 4 5 6 7 8 9 Optimizing rounds 0.8 Fig. 10 The results of iterative bootstrapping optimization on Lung cancer data 0.6 value C: Consistency P: Performance Table 8 The classification results of iterative bootstrapping method 0.4 R: ratio= P/(C) on Lung Cancer data. An independent test dataset was used for val- idation 0.2 Dataset Number of selected TP TN FP FN Classification genes accuracy 0 Lung 34 121 15 0 13 91.28% 0 1 2 3 4 5 6 7 8 9 cancer Optimizing rounds Fig. 9 The optimizing results of iterative bootstrapping method on breast cancer data From the perspective of generalization error, it should Table 7 The classification validation results of iterative bootstrap- be pointed out that the experimental results can be seen as ping method on Breast Cancer data. An independent test dataset was totally unbiased, because the data for validation is inde- used for validation pendent and never used in the training process, i.e. before Dataset Number of selected TP TN FP FN Classification the final informative genes are selected, the test data is genes accuracy isolated and has no correlation with these genes. Therefore, the selected informative genes are entirely fair to any given Breast 50 5 7 5 2 63.16% cancer data for validation. Such a mechanism of gene selection might result in the bad performance in certain microarray datasets, which is due to the special characteristics of these 5 Summary datasets. This makes the reported good results in some published papers on these datasets suspect, as also dis- The results with the bootstrapping consistency gene cussed by [29]. Recently, many papers have reported on selection method described in this paper have demonstrated development of guidelines and procedures for more reli- that the consistency concept can be used for gene selection able microarray profiling [30–32], reviewed existing to solve the reproducibility problem in microarray data methods [33, 34] and suggested improvements in meta- analysis. The main contribution of the consistency method analysis [35]. However, none of these works have tackled is that it ensures the reliability and generalizability of explicitly the problem of consistency in the gene selection microarray data analysis experiment, and improves the step, as investigated by us. disease classification performance as well. In addition, The consistency concept was investigated on six because the method does not need previous knowledge benchmark microarray datasets and one proteomic dataset. about the given microarray data, it can be used as an The experimental results show that the different microarray effective tool in unknown disease diagnosis. datasets have different consistency characteristics, and that 123 538 Neural Comput & Applic (2007) 16:527–539 Table 9 Classification accuracy comparison: Consistency method 3. Jaeger J, Sengupta R et al (2003) Improved gene selection for results vs. known results from literature classification of microarrays. In: Paper presented at the Pacific Symposium on Biocomputing Dataset Average classification accuracy 4. Tusher V, Tibshirani R et al (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Consistency method (%) Publication (%) Acad Sci USA 98(9):5116–5121 Lymphoma 95.84 72.50 5. Zhang C, Lu X, Zhang X (2006) Significance of gene ranking for classification of microarray samples. IEEE/ACM Trans Comput Leukaemia 94.12 85.00 Biol Bioinform 3(3):312–320 CNS cancer 65.00 83.00 6. Duch W, Biesiada J (2006) Margin based feature selection filters Colon cancer 83.81 87.00 for microarray gene expression data. Int J Inform Technol Intell Comput 1:9–33 Ovarian cancer 98.80 97.00 7. Draghici S, Kulaeva O et al (2003) Noise sampling method: an Breast cancer 63.16 94.00 ANOVA approach allowing robust selection of differentially Lung cancer 91.28 90.00 regulated genes measured by DNA microarrays. Bioinformatics 19(11):1348–1359 8. Efron B, Tibshirani R et al (2001) Empirical bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160 better consistency can lead to an unbiased and reproducible 9. Lee KE, Sha N et al (2003) Gene selection: a Bayesian variable outcome with good disease prediction accuracy. selection approach. Bioinformatics 19(1):90–97 The recommended protocol for using our method is as 10. Tibshirani RJ (2006) A simple method for assessing sample sizes in microarray experiments. BMC Bioinform 7:106 follows: 11. Kauai H, Kasabov N, Middlemiss M et al (2003) A generic connectionist-based method for on-line feature selection and 1. Use Eq. (6) with your training/test sets. modelling with a case study of gene expression data analysis. In: 2. Run your classification algorithm of choice. Paper presented at the Conferences in Research and Practice in 3. Use Eq. (7) with your training/test sets. Information Technology Series: proceedings of the First Asia- 4. Run your classification algorithm of choice again with Pacific bioinformatics conference on Bioinformatics 2003, vol 19, Adelaide, Australia same settings. 12. Wang Z, Palade V, Xu Y (2006) Neuro-Fuzzy ensemble approach 5. Choose the results with the test/training set which for microarray cancer gene expression data analysis. In: Pro- gives better consistency in step 2 or 4. ceedings of 2006 international symposium on evolving fuzzy 6. Run the better model with total data or with new future systems, pp 241–246 13. Wolf L, Shashua A et al (2004) Selecting relevant genes with a datasets. spectral approach (No. CBCL Paper No.238). Massachusetts We believe our implementation of the classification Institute of Technology, Cambridge 14. Huerta EB, Duval B et al (2006) A hybrid GA/SVM approach for consistency using iterative bootstrapping can provide a gene selection and classification of microarray data. Lect Notes small set of informative genes which perform consistently Comput Sci 3907:34–44 with different data subsets. Compared with the traditional 15. Alon U, Barkai N et al (1999) Broad patterns of gene expression gene selection methods without using consistency mea- revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA surement, bootstrapping consistency method can thus pro- 96(12):6745–6750 vide more accurate classification results. More importantly, 16. Li L, Weinberg CR et al (2001) Gene selection for sample results demonstrate that gene selection with the consistency classification based on gene expression data: study of sensitivity measurement is able to enhance the reproducibility and to choice of parameters of the GA/KNN method. Bioinformatics 17(12):1131–1142 consistency in microarray and proteomics based diagnosis 17. Wahde M, Szallasi Z (2006) Improving the prediction of the decision systems. This is important when the classification clinical outcome of breast cancer using evolutionary algorithms. models are used to analyze new future datasets. Soft Comput 10(4):338–345 18. Wahde M, Szallasi Z (2006) A Survey of methods for classifi- Acknowledgments The research presented in the paper was par- cation of gene expression data using evolutionary algorithms. tially funded by the New Zealand Foundation for Research, Science Expert Rev Mol Diagn 6(1):101–110 and Technology under the grant: NERF/AUTX02-01. 19. Mukherjee S, Roberts SJ (2004) Probabilistic consistency anal- ysis for gene selection. Paper presented at the CSB, Stanford 20. Mukherjee S, Roberts SJ et al (2005) Data-adaptive test statistics for microarray data. Bioinformatics 21(Suppl 2):ii108–ii114 References 21. Shipp MA, Ross KN et al (2002) Supplementary information for diffuse large B-cell lymphoma outcome prediction by gene- 1. Ding C, Peng H (2003) Minimum Redundancy Feature Selection expression profiling and supervised machine learning. Nat Med for Gene Expression Data. In: Paper presented at the Proc. IEEE 8(1):68–74 Computer Society Bioinformatics Conference (CSB 2003), 22. Golub TR (2004) Toward a functional taxonomy of cancer. Stanford Cancer Cell 6(2):107–108 2. Furey T, Cristianini N et al (2000) Support vector machine 23. Pomeroy S, Tamayo P et al (2002) Prediction of central nervous classification and validation of cancer tissue samples using system embryonal tumour outcome based on gene expression. microarray expression data. Bioinformatics 16(10):906–914 Nature 415(6870):436–442 123 Neural Comput & Applic (2007) 16:527–539 539 24. Petricoin EF, Ardekani AM et al (2002) Use of proteomic pat- 30. Staal FJT, Cario G et al (2006) Consensus guidelines for terns in serum to identify ovarian cancer. Lancet 359:572–577 microarray gene expression analyses in leukemia from three 25. Van ’t Veer LJ, et al. (2002) Gene expression profiling predicts European leukemia networks. Leukemia 20:1385–1392 clinical outcome of breast cancer. Nature 415:530–536 31. Allison DB, Cui X et al (2006) Microarray data analysis: from 26. Gordon GJ, Jensen R et al (2002) Translation of microarray data disarray to consolidation and consensus. Nat Rev Genet 7:55–65 into clinically relevant cancer diagnostic tests using gene 32. Kawasaki ES (2006) The end of the microarray tower of babel: expression ratios in lung cancer and mesothelioma. Cancer Re- will universal standards lead the way? J Biomol Tech 17:200–206 search 62:4963–4967 33. Pham TD, Wells C et al (2006) Analysis of microarray gene 27. Breiman L, Spector P (1992) Submodel selection and evaluation expression data. Curr Bioinform 1:37–53 in regression: the Xrandom case60. Int Stat Rev 60:291–319 34. Asyali MH, Colak D et al (2006) Gene expression profile clas- 28. Kohavi R (1995) A study of crossvalidation and bootstrap for sification: a review. Curr Bioinform 1:55–73 accuracy estimation and model selection. In: Paper presented at 35. Sauerbrei W, Hollander N et al (2006) Evidence-based assess- the international joint conference on artificial intelligence (IJ- ment and application of prognostic markers: the long way from CAI), Montreal single studies to meta-analysis. Commun Stat Theory Methods 29. Ransohoff DF (2005) Bias as a threat to the validity of cancer 35:1333–1342 molecular marker research. Nat Rev Cancer 5(2):142149 123