Accepted Manuscript Feature selection based on brain storm optimization for data classification Farhad Pourpanah, Yuhui Shi, Chee Peng Lim, Qi Hao, Choo Jun Tan PII: S1568-4946(19)30229-7 DOI: https://doi.org/10.1016/j.asoc.2019.04.037 Reference: ASOC 5467 To appear in: Applied Soft Computing Journal Received date : 21 March 2018 Revised date : 8 April 2019 Accepted date : 22 April 2019 Please cite this article as: F. Pourpanah, Y. Shi, C.P. Lim et al., Feature selection based on brain storm optimization for data classification, Applied Soft Computing Journal (2019), https://doi.org/10.1016/j.asoc.2019.04.037 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. *Highlights (for review) A hybrid model (FAM-BSO) for feature selection and data classification is proposed. Fuzzy ARTMAP (FAM) is first used for incremental learning of data. A Brain Storm Optimization (BSO) is then used for feature selection. The outcome indicates that FAM-BSO is able to produce promising results. *Manuscript Click here to view linked References Feature Selection Based on Brain Storm Optimization for Data Classification Farhad Pourpanaha,∗, Yuhui Shib , Chee Peng Limc , Qi Haob , Choo Jun Tand a College of Mathematics and Statistics, Shenzhen University, China b School of Computer Science and Engineering, Southern University of Science and Technology, China c Institute for Intelligent Systems Research and Innovation, Deakin University, Australia d School of Science and Technology, Wawasan Open University, Malaysia Abstract Brain storm optimization (BSO) is a new and effective swarm intelligence method in- spired by the human brainstorming process. This paper presents a novel BSO-based feature selection technique for data classification. Specifically, the Fuzzy ARTMAP (FAM) model, which is employed as an incremental learning neural network, is com- bined with BSO, which acts as a feature selection method, to produce the hybrid FAM-BSO model for feature selection and optimization. Firstly, FAM is used to cre- ate a number of prototype nodes incrementally. Then, BSO is used to search and select an optimal sub-set of features that is able to produce high accuracy with the minimum number of features. Ten benchmark problems and a real-world case study are employed to evaluate the performance of FAM-BSO. The results are quantified statistically using the bootstrap method with the 95% confidence intervals. The out- come indicates that FAM-BSO is able to produce promising results as compared with those from original FAM and other feature selection methods including particle swarm optimization, genetic algorithm, genetic programming, and ant colony optimization. Keywords: Feature selection, brain storm optimization, fuzzy ARTMAP, data classification 1. Introduction Feature selection identifies an optimal subset of features from data samples for classification problems. Feature selection methods aim to improve the predictive accuracy and/or reduce the computational time of classification algorithms. They are important in data mining and pattern recognition problems. However, selecting ∗ Corresponding author Email addresses:
[email protected](Farhad Pourpanah),
[email protected](Yuhui Shi),
[email protected](Chee Peng Lim),
[email protected](Qi Hao),
[email protected](Choo Jun Tan) Preprint submitted to Applied Soft Computing April 7, 2019 an optimal subset of features is a challenging optimization task, and is time consuming due to the large search space and complex relationships among the features [1]. Feature selection techniques consist of two parts: (i) a search technique to find the optimal subset of features, and (ii) a classifier or learning algorithm to evaluate the effectiveness of the selected subset of features. In general, feature selection techniques can be divided into three categories: filter-based, wrapper-based and embedded-based methods [2]. Filter-based methods mainly focus on the properties of data samples without considering the underlying learning scheme [3]. In contrast, wrapper-based methods employ a classifier or a learning algorithm to evaluate the effectiveness of various feature subsets, and adopt a search technique to find the optimal one. Un- like wrapper methods, embedded methods considers feature selection in the training process, in order to reduce the computational time for re-classifying different fea- ture subsets. Regularization methods, e.g., the least absolute shrinkage and selection operator (LASSO) [4] and Elastic Net [5], are popular embedded methods. Since wrapper-based methods consider both feature subset and classifier, they are general- ly more effective than filter based methods. Nevertheless, wrapper methods are more complex, because a classifier is required to re-learn each selected feature subset [6]. Many wrapper based feature selection methods have been proposed to perform the search process and find the optimal subset of features, such as sequential forward selection (SFS) and sequential backward selection (SBS) [7]. While both methods can produce promising results, they suffer from several problems such as nesting effects [3] and computational complexity [1]. To overcome these problems, population-based evolutionary computation (EC) algorithms such as the genetic algorithm (GA) [8, 9], ant colony optimization (ACO) [10], genetic programming (GP) [11] and particle swarm optimization (PSO) [12], have been widely used. These algorithms have shown promising results in solving single-, multi- and many-objective optimization problems due to their capability of conducting global search. Many EC methods have been adopted for feature selection. Each method has its own advantages and disadvantages. As an example, the learning process of PSO, which is based on the Euclidean distance [13], makes it difficult in terms of adaptabil- ity. Nevertheless, PSO is less complex, both in run-time and memory requirements, and has the capability of fast convergence as compared with the GA and GP models [14, 15]. PSO also has a simple structure, which requires few number of parameters to adjust. These promising properties of PSO make it a useful approach, which has been applied to many fields including feature selection. Comparatively, brain storm optimization (BSO) [16] is a new swarm intelligence algorithm inspired by the human brainstorming process. To the best of our knowledge, the effectiveness of BSO in feature selection problems is yet to be investigated, which is the key focus of this paper. On the other hand, fuzzy ARTMAP (FAM) [17] is a supervised neural network that merges the capability of Adaptive Resonance Theory (ART) [18] in solving the stability-plasticity dilemma with the capability of fuzzy set theory in handling vague 2 and imprecise human linguistic information. FAM is an incremental learning model that operates by determining the similarity degree among its existing prototype nodes and the input samples against a threshold. If the similarity measure is not satisfied, FAM is able to incrementally add a new prototype node in its structure to encode the current learning sample, without forgetting or corrupting previously learned samples. It has been used with the GA to solve data classification problems [19, 20, 21]. In this study, we propose a hybrid FAM-BSO model for feature selection. Firstly, FAM is used to learn the input samples incrementally. Then, BSO is adopted to search and select an optimal feature subset. The proposed FAM-BSO model exploits the concept of ”open prototype” to keep the most important input features while maximizing classification accuracy. Ten benchmark problems are employed to eval- uate the effectiveness of FAM-BSO. The results are quantified statistically using the bootstrap method [22] with 95% confidence intervals. In addition, a real-world human motion recognition problem is used to demonstrate the applicability of FAM-BSO. Building upon our previously proposed fuzzy based feature selection and rule extraction model reported in [21], the main contribution of this study is to hybridize a nature-inspired optimization algorithm, BSO, with a neural-fuzzy algorithm (FAM) to undertake feature selection and classification problems. Unlike the GA and PSO models which use the best solution and global best solution, respectively, to update the individual solutions, the brainstorming process of BSO use all possible solutions to generate new ones. As compared with the GA and PSO models, this updating mechanism helps BSO to scape from local optima. While, PSO has a fast convergence rate, it is not suitable for multi-model optimization problems [23]. Nevertheless, FAM-BSO requires longer execution durations owing to the procedure of clustering its solutions (as shown in Section 6.2). In summary, the main contributions of this paper are two-fold: • a hybrid FAM-BSO model that is able to maximize classification accuracy and minimize the number of features; • a comprehensive evaluation of FAM-BSO in feature selection and data classifi- cation using benchmark and real-world problems, and performance comparison pertaining to the feature selection capability of BSO (which constitutes a new application of BSO) with other EC-based algorithms. The rest of this paper is organized as follows. Section 2 presents a review on both traditional and EC-based feature selection methods. Section 3 explains the structures of BSO and Adaptive Resonance Theory (ART) models. The details of proposed FAM-BSO model is described in Section 4. Section 5 presents the computational complexity of FAM-BSO. Section 6 provides the experimental results and discussion. Finally, conclusions and suggestions for further research are given in section 7. 3 2. Related Work This section reviews both traditional and EC-based feature selection methods. T- wo well-known traditional feature selection methods are SBS [24] and SFS [25]. These methods sequentially add or remove features until no improvement in classification accuracy is observed. Once certain features are removed or selected, they cannot be updated in future steps, which is the main drawback of the SBS and SFS methods. To alleviate this problem, sequential forward floating selection (SFFS) and sequential backward floating selection (SBFS) algorithms are devised [26], which use a floating search strategy with SFS and SBS, respectively. Both algorithms produce reasonably good results as compared with those from branch and bound methods. Since the traditional methods usually evaluate all possible solutions, and then select the best feature subset; it is computationally expensive, especially when the feature dimension is high. A feature selection method based on feature association map to tackle both supervised and unsupervised problems was proposed in [27]. The concept of graph-theoretic is used to select significant features. While the proposed method produces acceptable results, it is not able to select the optimal feature subset by employing the maximum independent set or minimum vertex. To alleviate the limitations of traditional methods, EC-based algorithms have been applied to tackle feature selection problems. The GA has been the first EC-based technique used for feature selection [8], which outperforms sequential search. Later in [9], single objective and two-objective GA-based techniques are proposed for both feature selection and rule extraction. In [28], the local search operations are embedded into the GA to devise a hybrid GA (HGA) to tackle feature selection problems. The results show that HGA outperforms the conventional GA. In [29], a novel feature selection method using clustering and ACO has been proposed to tackle data classification tasks. It partitions all features into several subsets, and ACO is used to find the optimal one. The proposed method produces better results as compared with those from both filter and wrapper methods. In [30], a hybrid model consisting of ACO and artificial neural networks for feature selection has been developed. The resulting hybrid model is applied to medical data sets, and promising results are reported. In [31], an ACO-based feature selection method has been proposed to tackle text-related classification tasks. It shows superior results in tackling the Reuters-21578 data set as compared with those of the GA and information gain. GPmtf s [32] is an online feature selection technique based on GP. It constructs a population of classifiers with randomly selected feature subsets. To solve c-class classification problems, GPmtf s equips a classifier with c trees. GPmtf s produces good results, as compared with those of SFS and SBS. In [33], a two-stage GP method for feature selection without suffering from the over-fitting problem has been developed. The proposed method converges faster and achieves better classification accuracy rates as compared with those of conventional GP. Recently, NiSuFs [34], which is a niching-based GP feature selection technique, has been proposed to select an optimal 4 feature subset. Since NiSuFs uses a surrogate model, it only requires 10% of the conventional GP training time to find the optimal feature subset. The PSO algorithm has been first used in [35] to tackle feature selection problems. A binary PSO algorithm has been combined with a feed-forward neural network to construct a parsimonious quantitative structure-activity relationship (QSAR) mod- el. Since then, many PSO-based models have been successfully developed to solve feature selection problems, due to its simple structure, fast convergence, and cost- effectiveness (in terms of both memory and run time). As an example, a feature selection method based on PSO and rough set proposed in [36] is able to achieve better results for rough set-based feature selection as compared with the conventional GA. A hybrid model of PSO-SVM (Support Vector Machine) has been introduced in [37] to improve classification accuracy while choosing a small feature subset. PSO is used to simultaneously optimize SVM parameters and select the best feature subset. In [38], a PSO-based feature selection method, which employs a fuzzy-based fitness function, has been introduced. It is able to achieve similar results as compared with those of the conventional GA. Many EC-based techniques have been combined with fuzzy systems. In [39], a generalized knowledge acquisition method combined with a swarm intelligence model called KASIA-G has been proposed. KASIA-G reduces the number of experiments by allowing the canonical algorithm to use the knowledge bases with different sizes. The outcome indicates that KASIA-G is able to produce similar results as those of KASIA and Pittsburgh methods in learning fuzzy rules from meta-schedulers pertaining to grid computing. In [40], two EC-based optimization techniques, i.e., PSO and simu- lated annealing (SA) algorithms, have been used to find optimal Takagi-Sugeno-Kang (TSK) fuzzy models for anti-lock braking systems. In [41], PSO, SA and gravitational search (GS) techniques have been employed to tune the parameters of the TSK fuzzy controller, which uses the linear matrix inequality as constraints in the optimization problems. In [42], an adaptive feature selection method based on discrete PSO has been developed to solve binary classification problems. The developed method depicts promising results as compared with those of scatter search (SS) and tabu search (TS). In [43], new initialization strategies and updating procedures have been proposed for PSO to solve feature selection problems. The developed technique outperforms standard PSO and other traditional feature selection methods. Later, two PSO-based hybrid filter-wrapper feature selection techniques, i.e., FastPso and RapidPSO, [44] have been proposed. Both methods reduce the complexity of PSO-based models, while achieving similar or better accuracy rates when all features are used for classification. 3. The Structure and Dynamics of BSO and ART models In this section, we firstly explain the BSO model. Then, the structure of ART, specifically, Fuzzy ART and FAM models, are described in detail. 5 3.1. Brain Storm Optimization The brainstorming process has been widely used for problems that cannot be solved by one individual. To overcome these problems, individuals from different background are gathered to brainstorm. The main aim of brainstorming is to generate as many different ideas (solutions) as possible, and good solutions can be obtained to solve a specific problem from the generated ideas [16, 45]. BSO [16] is a new population-based swarm intelligence method inspired by the human brainstorming process. BSO generates n random possible solutions, and e- valuates them based on a fitness function. According to [45], BSO contains three steps, which are clustering individuals, disrupting the cluster centers, and creating solutions. BSO clusters n individuals into m clusters using the k -means clustering algorithm. Similar solutions are clustered together during each generation. A new solution is generated with a probability of P, and it replaces a cluster center (selected randomly), through a disrupting cluster center operation. Finally, BSO generates a new individual using one cluster, or by combining two clusters. To generate a new individual, BSO randomly selects one or two cluster(s), with a probability of P-one. Then, BSO randomly selects one individual based on one or two cluster(s) center(s), as follows: Xi , one cluster Xselected = (1) rand × X1i + (1 − rand) × X2i , two clusters where X1i and X2i are the i-th dimension of the selected clusters, and rand is a random value between 0 and 1. BSO updates the selected individual as follows: Xnew = Xselected + ξ ∗ random(0, 1) (2) where random is a Gaussian random value with 0 mean and unit variance, respec- tively; ξ is the adjusting factor, i.e., 0.5 ∗ mi − ci ξ = logsin( ) × rand (3) k where mi and ci are the maximum number of iterations and the current iteration, respectively; logsin() indicates the logarithmic sigmoid function, rand() is a random value between 0 and 1, and k is a changing rate for the slope of logsin() function. The detailed steps of BSO are summarized in Figure 1. 6 10/24/2018 Untitled Diagram.xml Start Generate n random solutions Cluster n solutions into m groups Evaluate all solutions Rank solutions in each cluster and select the cluster center Randomly select a cluster and disrupt its center based (if Rand<P5a) Yes No Rand<P6b Randomly select one cluster with probability p6bi Randomly select two clusters Yes Yes Rand<P6biii Rand<P6c No No Select a solution from same cluster Select a solution from each cluster and generate a new solution and generate a new solution Generate a new solution using cluster(s) center Update individual using new step size Evaluate new individual Yes No Is new Xi better than Xi? Xi = new Xi Xi = Xi No Termination? Yes Stop Figure 1: The BSO procedure (adopted from [16]). Firstly, BSO generates n random solutions and evaluates them. After clustering the solutions into m groups (using the k -mean clustering algorithm), they are ranked based on their fitness values, and the best solution in each cluster is set as the cluster center. Next, based on probability p5a , it randomly disrupts a cluster center. Then, each solution is updated based on one/two cluster center(s) or solution(s) which are randomly selected based on the probabilities P6biii and P6c , respectively. Finally, if the new solution performs better than the current solution, replacement takes place. 7 1/1 3.2. Adaptive Resonance Theory Adaptive Resonance Theory (ART) [18] is known as a human cognitive informa- tion processing theory which has led to evolve many online neural network models, such as ART1 [46], ART2 [47], Fuzzy ART [48] for solving unsupervised problems and, ARTMAP [49] and Fuzzy ARTMAP (FAM) for solving supervised problems, just to name a few. These models can stably learn to categorize input patterns that are presented in an arbitrary sequence. Fuzzy ART is the most prominent member of unsupervised ART architectures with capability of clustering both binary and analog input patterns. It has been used to develop a hierarchical network, known as FAM, to map M -dimensional input vectors into N -dimensional output vectors. Dynamics of Fuzzy ART and FAM are explained in detail in Appendices A and B, respectively. The step-by-step of the learning phase of FAM is summarized in Algorithm 1. Algorithm 1 The learning phase of FAM Input: Parameters of FAM and training samples. Output: Parameters of trained FAM. 1: for each training sample (a1 , ..., am ) do 2: Complement-coding (Eq. (A.1)). 3: Calculate the choice function for all prototype nodes (W a ) (Eq. (A.2)). 4: Select the winning node using Eq. (A.4) (J-th node in W a ). 5: Perform the vigilance test (Eq. (A.5)) for the winning node. 6: if the vigilance test is not satisfied then 7: Deactivate the winning node (Tj = 0). 8: if all prototype nodes in W a are deactivated then 9: Add new node. 10: else 11: Go to Step 3. 12: end if 13: end if 14: (Repeat Steps 2-13 simultaneously for the target vector of the current input and select the winning prototype node(K-th) in W a . 15: Perform the map-filed vigilance test (Eq. (B.1)). 16: if the condition in Eq. (B.1) is not satisfied then 17: Perform match-tracking (Eq. (B.3)). 18: Deactivate the winning node (J-th) in W a . 19: Go to Step 3. 20: else 21: Update the winning node using Eq. (B.4). 22: end if 23: end for 8 4. The FAM-BSO Model The proposed FAM-BSO model contains two main stages, which are the learning stage and feature selection stage. FAM is used in the first stage to learn the training samples, and BSO is employed in the second stage to extract the best feature subset with the aim to increase classification accuracy and reduce model complexity. The details are explained as follows. 4.1. Open prototype node Once FAM learning (as explained in section 3.2 ) is completed, all the created prototype nodes in the f2a layer are employed to generate the ”open” prototypes, which enable FAM to include the ”don’t care” antecedent [9]. In FAM, the ”don’t care” feature can be satisfied by setting low and high vectors of the corresponding dimension with 0 and 1, respectively. An example of an ”open” prototype node for an input pattern with three features is shown in Table 1. The number of all possible ”open” prototype nodes, except one with all ”don’t care” antecedents, is (2d − 2), where d is the dimension of the input pattern. Table 1: Original and possible open prototype nodes for an input pattern with three features Prototype nodes Feature 1 Feature 2 Feature 3 Original prototype node A B C Open prototype node 1 A Don’t care C Open prototype node 2 Don’t care B Don’t care Open prototype node 3 Don’t care Don’t care C Open prototype node 4 A B Don’t care Open prototype node 5 A Don’t care C Open prototype node 6 Don’t care B C 4.2. Adaptation of BSO for feature selection In BSO, each solution, S, contains p × d features, as follows: S = {D11 , D21 , ..., Dd1 , D12 , D22 , ..., Dd2 , ..., D1p , ..., Ddp } (4) where p is the number of prototype nodes in the f2a layer and d indicates the dimension of each prototype node. Then, Ddp is considered as follows: p don0 t care f eature, if Ddp < θ Dd = (5) other f eature, if Ddp ≥ θ where 0 < θ < 1. 9 The BSO fitness function is formulated to reduce the classification error, as follows: N CP F itness = 1 − (6) T NS where NCP indicates the number of correctly classified samples and TNS presents the total number of samples. The aim is to minimize the error rate. The step-by-step measurement of the fitness value of a single solution, S, is sum- marized in Algorithm 2. Algorithm 2 The procedure for measuring the fitness value for a single solution Input: Parameters of the trained FAM, Validation samples, Generated solution S by BSO Output: Fitness value 1: Create the ”open” prototypes. 2: Initialize the ”don’t care” features (Eqs. (4) and (5)) by setting low and high vectors of the corresponding dimension with 0 and 1, respectively. 3: for each validation sample do 4: Complement-coding (Eq. (A.1)). 5: Calculate the choice value for all of the prototype nodes (Eq. (A.2)). 6: Select the winner node using Eq. (A.4) (J-th node in W a ). 7: Perform vigilance test (Eq. (A.5)) for the winner node. 8: if condition in Eq. (A.5) is satisfied then 9: Select the prototype node as the winning node (J-th). 10: else 11: Predict the current validation sample as Unknown target. 12: end if 13: Update the performance indicator. 14: end for 15: Compute the fitness value using Eq. (6). 4.3. Summary of the proposed FAM-BSO model Figure 2 shows the flow chart of FAM-BSO. The data set is firstly split into three subsets, i.e., learning, validation and test sets. After FAM is trained using the learning set, n random solutions are generated based on Eq. (4). Next, for each solution, ”open” prototypes are created using Eq. (5), and the fitness value is measured using the validation set. To determine the fitness value, FAM-BSO compute the choice function of each validation sample using Eq. (A.2), and then conducted the vigilance test for the winning prototype node using Eq. (A.5). After that, all solutions are clustered into m groups using k -mean clustering. The cluster center is disrupted before updating the individuals using Eqs.(1), (2) and (3). After 10 that, ”open” prototypes are created and the fitness value of each updated solution is measured. If the new solution performs better than the current solution, replacement takes place. Finally, if the termination condition is Untitled 10/18/2018 satisfied, the best feature subset Diagram.xml is used to evaluate performance using the test set. The time complexity of FAM-BSO is analysed in Appendix C. Start Split data into three subsets i.e., learning, Evaluate all random solutions using validation and test samples validation samples and Algorithm 2 Yes Train FAM using learning samples Termination? No Cluster solutions into m groups using k- Generate n random solutions mean clustering Disrupt cluster center Evaluate performance of the selected Update individual solutions features using test samples and Algorithm 2 End Figure 2: The proposed FAM-BSO. Firstly, FAM-BSO uses FAM to learn based on training samples. Next, it generates n random solutions and evaluates them using the validation samples and Algorithm 2. After clustering the solutions into m groups, they are ranked in each group based on their fitness values, and the best solution in each group is selected as the cluster center. Then, it randomly disrupts a cluster center. Each solution is updated based on one/two cluster center(s) or solution(s) which is/are randomly selected. This procedure is repeated until the termination condition is satisfied. Finally, the selected feature subset is used to form the test samples for evaluation with Algorithm 2. 5. Experimental Studies To evaluate the effectiveness of FAM-BSO, ten benchmark data sets from the UCI machine learning repository [50] and a real-world case study, i.e., human motion detection [51], are used. The details of the UCI data sets are listed in Table 2. The selected data sets contain different feature characteristics, which include the numbers of features and samples that are used to assess the feature selection and classification capabilities of FAM-BSO. The PID samples overlapped each other, which is difficult 11 for classification. Ionosphere has fewer overlapping samples. Musk version 1 and Hill-valley are high-dimension problems. Sonar, German, Wine and WBCD poses moderate imbalance samples. Vehicle and Zoo are multi-class data sets. Zoo has fewer numbers of samples, but highly imbalanced. In addition, these data sets have employed for performance comparison with other EC-based feature selection methods reported in the literature. Table 2: Details of the UCI data sets Data set Number of Number of Number of input features data samples classes Musk Version 1 (MK1) 166 476 2 Sonar 60 208 2 Wine 13 178 3 German 24 1000 2 WBCD 30 569 2 Vehicle 18 846 4 Ionosphere 34 351 2 PID 8 756 2 Zoo 17 101 7 Hill-valley 100 606 2 In addition to accuracy and complexity (the numbers of prototype nodes and selected features), four performance indicators i.e., t-test, F-measure, sensitivity (re- call), and specificity, are used for performance comparison (Appendix D). Note that sensitivity and specificity are applicable to two-class problems; therefore they are excluded in multi-class data sets (i.e., Wine, Vehicle and Zoo). The experimental parameters are listed in Table 3, which are adopted from [16]. All experiments are conducted using MATLAB 2017a with an Intel 4GHz CPU and 8GB memory. Table 3: Parameters of FAM-BSO (parameters of BSO are adopted from [16]) n 100 mi 1000 αa 0.001 P5a 0.2 σ 1 αb 0.001 P6b 0.8 m {3, 5, 7} ρab 1 P6biii 0.4 µ 0 ρb 1 P6c 0.5 βa 1 ρa [0.2, 0.9] k 20 βb 1 θ [0.3, 0.9] 5.1. Experiments with the UCI data sets In this section, firstly, θ is varied from 0.3 to 0.9, in order to find an optimal setting that yields high accuracy rates with few numbers of features. The results are 12 compared with those of the original FAM and other state-of-the-art models reported in the literature. Then, different ρa setting ranging from 0.2 to 0.9 is used in order to compare FAM-BSO with other EC-based feature selection algorithms, i.e., FAM-GA and FAM-PSO. A total of 70% and 30% of the data samples are used for training (i.e., 60% for learning, and 10% for validation) and test, respectively. The 10-fold cross-validation method is adopted in the experiment only for the learning phase. In order to quantify the results statistically, each fold is repeated 30 times, and the bootstrap method [22] is used to compute the 95% confidence intervals. All data samples are normalized between 0 and 1. Table 4 shows the accuracy rates with the 95% confidence intervals of the original FAM and FAM-BSO with ρa =0.8 and different θ settings from 0.3 to 0.9. In gener- al, original FAM and FAM-BSO performs depict similar performances statistically, as there are overlaps between the 95% confidence intervals of their accuracy rates. Nevertheless, FAM-BSO uses fewer numbers of features, as compared with those of the original FAM, as shown in Table 5. FAM-BSO with θ = 0.5 outperforms 4 (i.e., Sonar, Wine, German and Vehicle) out of 10 data sets, as compared with the original FAM and FAM-BSO with other θ settings. In addition, Table 5 shows the average number of selected features for each prototype node by FAM-BSO and the original FAM (note that original FAM uses all features). As expected, the number of selected features reduces with increasing θ value. A performance comparison between FAM-BSO and other methods reported in literature is conducted. In the experiments, θ and ρa are set to 0.5 and 0.8, respec- tively. Specifically, FAM-BSO is compared with three PSO-based models reported in [43] and [52], the feature association map (U-FAM) [27], GP-based feature selection (GPmtf s ) [32], and ACO-based feature selection (ACO-ER) [1]. To have a fair com- parison, the same procedure used in [43, 52, 27, 32, 1] is adopted, i.e 70% and 30% of the data samples are used for training and test, respectively. Table 6 shows the accuracy rates of FAM-BSO, PSO (4-1) [43], PSO (3-1) [43], BPSO-2Stage [52], U-FAM [27], and ACO-ER [1]. As shown in Table 6, FAM-BSO achieves high accuracy rates is Sonar, Wine, German, WBCD, PID, and zoo data sets. While FAM-BSO does not outperform other methods for MK1 and Ionosphere data sets, it produces similar results as those of PSO-2Stage and ACO-ER, respectively. In addition, Table 7 shows the average numbers of selected features of FAM-BSO, PSO (4-1) [43], PSO (3-1) [43], (U-FAM) [27], and ACO-ER [1]. FAM-BSO with θ = 0.5 does not produce the smallest number of features as compared with those of other models. It selects approximately 50% of all original features. We have developed FAM-GA and FAM-PSO for performance comparison with FAM-BSO. Three different numbers of clusters, i.e., 3, 5 and 7, are implemented in BSO. For all methods, different ρa values ranging from 0.2 to 0.9 are used for FAM, while the maximum iteration and particle/population size are set to 1000 and 100, respectively. For FAM-PSO and FAM-BSO θ is set to 0.6. The GA parameters are 13 Table 4: The accuracy rate(%) for UCI data sets with 95% confidence intervals (”Mean”, ”Upper” and ”Lower” indicate the mean accuracy, upper and lower bounds of the 95% confidence intervals, respectively). Data FAM FAM-BSO θ 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Upper 88.99 83.02 83.52 83.60 86.11 86.53 88.61 87.03 MK1 Mean 85.72 80.84 82.65 82.23 83.48 84.42 86.67 86.12 Lower 82.02 78.60 81.80 79.99 81.38 82.09 85.01 84.95 Upper 86.50 80.78 84.44 87.87 88.91 90.47 87.01 89.70 Sonar Mean 84.99 77.73 81.93 86.01 86.77 87.97 86.45 86.07 Lower 83.11 74.72 79.81 83.81 85.37 82.83 85.07 83.17 Upper 97.14 75.32 85.14 88.25 97.99 98.88 98.43 98.43 Wine Mean 95.17 69.87 81.67 84.29 95.45 97.15 96.58 96.51 Lower 92.22 47.71 78.19 74.14 92.06 95.88 94.81 94.66 Upper 84.46 72.51 70.21 74.97 81.55 84.90 84.46 85.75 German Mean 79.19 64.76 67.17 71.35 76.97 82.92 81.55 82.69 lower 64.21 58.30 58.79 69.44 73.55 80.76 78.60 78.61 Upper 97.18 87.59 92.41 95.08 96.13 98.05 98.03 97.87 WBCD Mean 96.01 86.77 91.13 94.54 95.49 96.49 96.65 96.34 lower 94.96 85.49 87.72 94.24 94.67 95.34 95.75 94.97 Upper 79.59 58.88 64.23 71.06 77.69 86.31 80.93 81.59 Vehicle Mean 78.06 51.26 55.75 66.79 74.87 81.78 79.32 79.68 lower 75.41 45.37 52.09 61.09 71.68 79.36 76.71 77.01 Upper 92.42 87.31 84.71 88.51 89.84 93.43 92.78 94.22 Ionosphere Mean 91.68 82.67 82.45 87.21 88.88 91.93 91.69 92.18 lower 90.91 76.26 80.59 85.75 85.12 91.04 90.70 90.46 Upper 82.17 67.96 66.45 67.75 74.69 77.54 86.68 87.12 PID Mean 80.34 62.95 62.05 65.48 73.86 74.26 78.84 79.10 lower 79.18 50.68 55.55 62.41 72.64 71.56 75.65 76.16 Upper 98.35 83.76 82.06 92.92 97.51 98.42 98.33 99.35 Zoo Mean 95.40 71.33 76.57 87.96 95.85 95.71 94.79 96.22 lower 93.35 61.33 73.22 82.50 94.33 93.11 93.02 92.76 Upper 61.31 56.93 60.02 60.87 59.17 57.06 59.83 60.02 Hill-Valley Mean 58.82 53.31 56.21 59.01 57.62 56.25 55.84 58.63 lower 53.54 46.76 53.70 57.11 56.24 55.51 51.89 53.50 14 Table 5: The mean number of selected features for each prototype node Data set FAM FAM-BSO θ 0.9 0.8 0.7 0.6 0.5 0.4 0.3 MK1 166 17.63 33.18 49.83 66.63 83.34 99.81 116.38 Sonar 60 8.16 13.80 19.50 24.62 29.85 35.18 40.89 Wine 13 2.04 3.11 4.29 5.26 6.41 7.92 8.98 German 24 3.97 6.78 8.08 10.82 12.03 15.74 17.63 WBCD 30 3.36 7.23 9.92 12.11 14.84 17.88 20.81 Vehicle 18 3.06 4.61 5.80 7.43 9.03 10.61 12.09 Ionosphere 34 5.05 7.83 10.87 13.89 17.05 20.43 23.34 PID 8 1.35 1.94 2.62 3.32 4.01 4.65 5.36 Zoo 17 2.84 4.15 5.43 6.94 8.42 9.97 11.61 Hill-valley 100 15.04 24.76 32.16 39.94 48.34 56.77 67.47 Table 6: The average and best (in parentheses) accuracy rates (%) for UCI data sets (”-” indicates the data sets were not used for evaluation with the corresponding method, as published in the literature). Data FAM-BSO PSO(3-1) PSO(4-1) BPSO-2 U-FAM ACO- GPmtf s set (best) (best) (best) Stage(best) [27] ER [32] [43] [43] [52] [1] MK1 84.42 85.51 84.62 85.70 - 79.53 - (88.19) (90.21) (90.21) (89.51) - Sonar 87.97 78.03 77.40 - 72.00 - 86.26 (90.74) (85.71) (87.30) Wine 97.15 96.62 96.15 96.94 89.00 - 94.82 (100.0) (98.77) (98.88) (100.0) German 82.92 69.13 69.27 68.93 - 70.17 - (84.93) (71.67) (74.33) (73.67) WBCD 96.49 93.02 93.78 92.98 93.00 - 96.31 (98.83) (94.15) (94.74) (92.98) Vehicle 81.78 85.26 85.06 84.47 53.00 79.53 78.45 (88.65) (87.99) (86.61) (85.04) Ionos- 91.93 86.50 87.45 89.52 - 92.12 - phere (94.15) (94.29) (92.38) (93.33) PID 74.26 - - - 74.00 - - (79.69) Zoo 95.71 95.62 95.64 - - - - (98.74) (97.14) (97.14) Hill- 56.25 57.78 57.68 57.61 - 54.13 - valley (57.64) (60.71) (61.26) (60.44) 15 Table 7: The mean number of selected features (”-” indicates the data sets were not used for evaluation with the corresponding method, as published in the literature). Data set FAM-BSO PSO(3-1) PSO(4-1) BPSO- U-FAM ACO GPmtf s [43] [43] 2Stage [27] -ER [32] [52] [1] MK1 83.34 107.14 83.54 80.72 - 83.03 - Sonar 29.85 31.76 11.3 - 36.00 - 9.45 Wine 6.41 9.44 8.1 5.1 11.01 - 4.08 German 12.03 16.82 13.68 8.62 - 10.76 - WBCD 14.84 19.06 8.12 6.68 17.01 - 6.72 Vehicle 9.03 10.58 9.54 7.3 9.00 9.86 5.37 Ionosphere 17.05 18.38 3.26 8.9 - 12 - PID 4.01 - - - 7.00 - - Zoo 8.42 9.96 7.98 - - - - Hill-valley 48.34 60.65 13.16 37.1 - 47.63 - set to: mutation probability=0.1, crossover probability=0.9 and 20 prototypes are replaced in each population. The PSO parameters are set to: c1 = c2 = 1.49. Figure 3 shows the accuracy rates of the original FAM, FAM-GA, FAM-PSO and FAM-BSO with different numbers of clusters. For all except Hill-valley and Sonar data sets, FAM-BSO with m=3, 5 and 7 performs better than or similar to other methods. However, for the Hill-valley data set, FAM-BSO with m=3 and 5, and for the Sonar data set FAM-BSO with m=7 produce inferior results as compared with those from other methods. In addition, the created prototype nodes in FAM are shown in Figure 4. Note that all models, i.e., FAM-GA, FAM-PSO and FAM- BSO, have the same numbers of prototype nodes since FAM is used as the underlying learning model. The number of prototype nodes increases by varying ρa from 0.2 to 0.9, as shown in Figure 4. Table 8 indicates the selected features by FAM-GA, FAM-PSO and FAM-BSO. For all models, ρa is set to 0.7, and θ in both FAM-PSO and FAM-BSO is set to 0.6. FAM-BSO is able to reduce the model complexity considerably by selecting fewer or similar numbers of features as compared with those of FAM-GA and FAM- PSO. Moreover, the t-test score, F-measure score, sensitivity and specificity rates are presented in Appendix E. 5.2. A real-world case study In this section, FAM-BSO is evaluated with a real-world case study, i.e., human motion detection. A total of 390 raw waveform data from different subjects (humans) have been collected using the existing 3-axis accelerometer in smartphones. Three smartphones are located in three pockets of each subject: Shirt pocket, belt pocket, and front pocket [53] [54]. Each subject performs three types of activities: walking, running and climbing. The raw waveform data are recorded in the time domain. 16 MK1 Sonar Wine German WBCD 100 100 100 100 99 95 95 95 95 98 90 90 90 97 90 85 85 85 96 80 80 80 85 75 75 75 95 80 70 70 70 94 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 Vehicle Ionosphere PID Zoo Hiil-valley 100 100 100 100 100 95 95 90 95 90 90 95 80 85 85 90 80 80 90 70 75 75 85 60 70 70 85 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 FAM-GA FAM-PSO FAM-BSO (m=3) FAM-BSO (m=5) FAM-BSO (m=7) FAM Figure 3: The accuracy rates of FAM-GA, FAM-PSO and FAM-BSO with different ρa settings from 0.2 to 0.9. Except the Hill-valley and Sonar data sets, FAM-BSO produces better or similar results as compared with those of FAM-GA and FAM-PSO. MK1 Sonar Wine German WBCD 60 40 40 80 200 60 40 150 40 20 20 100 20 20 50 0 0 0 0 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6 Vehicle Ionosphere PID Zoo Hill-valley 150 150 28 60 25 26 100 100 20 24 40 22 50 15 20 20 50 10 18 0 0 16 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6 Figure 4: The number of created prototype nodes by FAM with different ρa settings. The number of nodes increases as ρa increases from 0.2 to 0.9. 17 Table 8: The mean number of selected features by FAM-GA, FAM-PSO and FAM- BSO with m=3, 5 and 7 (ρa = 0.7, and for both PSO and BSO θ is set to 0.6). Data set FAM-GA FAM-PSO FAM-BSO m=3 m=5 m=7 MK1 83.06 56.22 55.6 55.93 62.27 Sonar 30.14 18.66 26.03 23.22 27.21 Wine 6.61 5.29 5.54 4.96 5.23 German 13.12 20.11 16.68 13.42 13.81 WBCD 19.93 19.08 14.82 14.25 14.19 Vehicle 9.92 13.58 11.09 10.75 10.64 Ionosphere 17.9 11.99 12.98 16.16 13.24 PID 4.46 6.72 4.84 4.8 4.39 Zoo 8.56 9.01 8.1 8.23 7.48 Hill-valley 50.12 62.1 41.12 42.56 48.46 After pre-processing, nine statistical features: mean (m), root mean square (RMS), standard deviation (SD), Skewness, Kurtosis, Crest factor (CF), Latitude factor (LF), Shape factor (SF) and Impulse factor (IF), are extracted and normalized between 0 and 1. Therefore, each data sample consists of 27 features (9 statistical features × 3 axis). The extracted features are fed into FAM-BSO to recognize the respective subject’s motion, either walking, climbing or running. The 10-fold cross-validation was repeated 10 times, 90% of the data samples are used for training (80% for learning and 10% for feature selection) and the remaining 10% of data samples are used for test. To evaluate the robustness of FAM-BSO, noise is injected by randomly altering 10% of class labels pertaining to learning samples. The ρa value of all methods, FAM, FAM-GA, FAM-PSO and FAM-BSO, is set to 0.7, and θ for both FAM-PSO and FAM-BSO is set to 0.6. A design of experiment (DOE) method, i.e., the center composite design (CCD) [55], is used to analyze the effects of different setting of the number of clusters (m) and the probability (Pone or P6b ) towards the FAM-BSO performance. In this experiment, each parameter is set to three levels, i.e., low, medium and high, as shown in Table 9. Figure 5 shows five possible combinations of the parameters. The average fitness val- ues with standard deviation of all five experimental configurations are shown in Table 10. Exp. 5 (FAM-BSO with m=7, and Pone = 0) achieves the lowest fitness value for both noisy and noise-free data samples. This indicates that the individuals generated using two clusters perform the best, indicating the capability of the brainstorming process of BSO. We developed the Extreme Learning Machine (ELM)-BSO model for further per- formance comparison with FAM-BSO. Tables 11 and 12 show the accuracy rates and the numbers of selected features for noise-free and noisy data samples. As expected, the accuracy rates of all methods decrease with noise injected to the learning sam- 18 Table 9: The parameters and levels of center composite design. Parameter Level Low (-1) Medium (0) high (+1) Numbers of clusters (m) 3 5 7 Probability Pone (P6b ) 0 0.5 1 +LJK 0HGLXP P /RZ 3RQH +LJK Figure 5: A center composite design for two parameters, which consists of five points. Table 10: Effects of numbers of clusters and probability (Pone ) on average fitness value (standard deviation) of FAM-BSO. Experiment # Noise-free (std. dev.) Noisy (std. dev.) Experiment 1 (m=3, and Pone =0) 0.0154 (1.4e-3) 0.0616 (10.7e-3) Experiment 2 (m=5, and Pone =0.5) 0.0246 (8.8e-3) 0.0669 (13.5e-3) Experiment 3 (m=7, and Pone =1) 0.0112 (5.7e-3) 0.0613 (8.8e-3) Experiment 4 (m=3, and Pone =1) 0.0236 (17.9e-3) 0.0658 (9.5e-3) Experiment 5 (m=7, and Pone =0) 0.0098 (2.4e-3) 0.0574 (6.8e-3) ples. FAM-BSO with m=3 and 5 produced better accuracy rates for noise-free and noisy data sets, respectively. However, ELM-BSO with different number of clusters managed to use fewer numbers of features (Table 12). Table 13 shows the pairwise t-test results for performance comparison between FAM-BSO (m=3, 5, 7) versus FAM, FAM-GA, FAM-PSO and ELM-BSO (m=3, 5, 7). In term of accuracy (Table 13 (a)), FAM-BSO performed similar to or better than FAM-GA, except m=7 for noise-free data. FAM-BSO performed better than FAM-PSO and ELM-BSO in both noise-free and noisy data. In term of complexity (Table 13 (b)), FAM-BSO with m=3 for noise-free data, and m=3 and 7 for noisy data performed similar to FAM-GA. FAM-BSO (m=3, 5, 7) selected fewer numbers of features as compared with those of FAM-PSO, while ELM-BSO used fewer numbers 19 Table 11: The mean accuracy rates with standard deviations of FAM, ELM, FAM- GA, FAM-PSO, ELMM-BSO (m=3, 5, 7) and FAM-BSO (m=3, 5, 7) for the human motion recognition task (for FAM, ρa = 0.7, while for both BSO and PSO θ is set to 0.6). Noise level FAM ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 95.97 83.82 96.83 96.42 94.87 94.26 94.18 97.04 96.76 96.35 ±1.6 ±3.9 ±1.3 ±1.5 ±3.6 ±3.9 ±2.4 ±1.1 ±1.5 ± 1.6 10% 89.18 76.59 93.63 92.30 91.82 92.42 92.05 93.60 93.90 93.70 ±2.1 ±6.6 ±1.6 ±1.9 ±3.7 ±2.7 ±2.5 ±1.4 ±1.5 ± 1.8 Table 12: The mean number of selected features by FAM, ELM, FAM-GA, FAM- PSO, ELMM-BSO (m=3, 5, 7) and FAM-BSO (m=3, 5, 7) for human motion recog- nition task ( for FAM, ρa = 0.7, while for both PSO and BSO θ is set to 0.6). Noise level FAM & ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 27 12.8 17.2 12.6 11.9 12.1 13.6 15.8 15.7 10% 27 17.1 18.8 14.9 15.2 14.7 17.6 18.8 17.2 of features as compared with those of FAM-BSO. Table 14 shows the execution time of various methods. As expected, both ELM- BSO and FAM-BSO require longer execution time as compared with those of FAM-GA and FAM-PSO, which increase in line with more clusters and noise. This long exe- cution time is mainly due to clustering of the solutions in BSO. Although individual ELM requires a shorter execution time as compared with that of FAM, ELM-BSO requires a longer computational time as compared with that of FAM-BSO. This is mainly because of the requirement of re-training ELM for every subset of features, while FAM only needs to use open prototype nodes to compute the fitness value of the selected subset of features. In addition, we reduced the number of iterations of BSO (for both ELM-BSO and FAM-BSO) to 350, which required a similar execution time as that of FAM-GA and FAM-PSO. Table 15 shows the accuracy rates and number of selected features of FAM-BSO (m=3, 5, 7). In term of accuracy, as can be seen in Table 16 (a) (the pairwise t-test results), FAM-BSO performed similar to FAM-GA, except m=3 for noisy data, while outperformed FAM-PSO and ELM-BSO. In term of complexity (number of features), FAM-BSO performed worse than FAM-GA for noise-free data, but similar in noisy data. FAM-BSO generally performed better than FAM-PSO and ELM-BSO in both noise-free and noisy data (Table 16 (b)). Finally, we used the Principal Component Analysis (PCA) to rank the features, and remove those features with a variance score lower than 0.05. Then, the remaining features were used to evaluate the performance of FAM-BSO and other methods. After feature ranking with PCA, a total of 15 and 13 features remained for the noise- free and noisy data sets, respectively. Table 17 shows the accuracy rates of various 20 Table 13: The t-test results for pairwise comparison of FAM-BSO versus other methods for the human motion recognition task in terms of (a) accuracy, and (b) number of selected features (For FAM, ρa = 0.7; for both PSO and BSO θ is set to 0.6. Symbol ”+”/”-” indicates that FAM-BSO performs significantly better/worse than the indicated method, while symbol ”=” means both achieves a similar performance. level of noise m FAM-GA FAM-PSO ELM-BSO m=3 m=5 m=7 7 - = + + + 0% 5 = + + + + 3 + + + + + 7 = + + + + 10% 5 = + + + + 3 = + + + + (a) level of noise m FAM-GA FAM-PSO ELM-BSO m=3 m=5 m=7 7 - + - - - 0% 5 - + - - - 3 = + - - - 7 = + - - - 10% 5 - = - - - 3 = + - - - (b) Table 14: Computational time (second) with standard deviations of FAM, ELM, FAM-GA, FAM-PSO, ELM-BSO (m=3, 5, 7) and FAM-BSO (m=3, 5, 7) for human motion recognition task (for FAM, ρa = 0.7, while for both PSO and BSO θ is set to 0.6). Noise level FAM ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 0.42±0.016 0.35±0.012 51.0±3.60 63.6±4.32 591.2±14.14 608.5± 10.81 597.3±16.60 201.6±6.73 219.0±9,42 226.8± 7.43 10% 0.46±0.014 0.36±0.016 56.4±2.86 79.2±3.92 652.1±14.75 696.3± 17.60 654.7± 16.75 212.4± 7.32 224.4±6.08 234.0± 6.46 methods. FAM-BSO outperformed other methods for both noise-free and noisy data sets. Overall, accuracy of all methods reduced for both noise-free and noisy samples as compared with those with all features (Table 11). From Table 18, all methods selected approximately 50% of the features as compared with those with all features (Table 12). 21 Table 15: The accuracy rates and mean number of selected features of FAM-BSO (m=3, 5, 7) for human motion recognition task (for FAM, ρa = 0.7 and number of iterations=350 level of noise Accuracy Number of selected features m=3 m=5 m=7 m=3 m=5 m=7 0% 96.75± 1.6 96.64± 1.9 96.78± 1.4 14.2 17.6 16.4 10% 93.16± 1.7 93.52± 1.5 93.48± 1.6 17.8 15.2 19.8 Table 16: The t-test results for pairwise comparison of FAM-BSO versus other methods for the human motion recognition task. In terms of (a) accuracy, and (b) number of selected features (For FAM, ρa = 0.7; for both PSO and BSO θ is set to 0.6). Symbol ”+”/”-” indicates that FAM-BSO performs significantly better/worse than the indicated method, while symbol ”=” means both achieves a similar performance). level of noise m FAM-GA FAM-PSO ELM-BSO m=3 m=5 m=7 7 = + + + + 0% 5 = + + + + 3 = + + + + 7 = + + + + 10% 5 = + + + + 3 - + + + + (a) level of noise m FAM-GA FAM-PSO ELM-BSO m=3 m=5 m=7 7 = + + + + 0% 5 = + + + + 3 = + + + + 7 = + + + + 10% 5 = + + + + 3 - + + + + (b) 5.3. Discussion In general, FAM-BSO is able to produce promising results, in terms of both ac- curacy and the number of selected features, as compared with the original FAM and other feature selection methods. Nevertheless, several improvements are possible. One possible improvement is generating and updating new solutions in BSO. In this research, the solutions (individuals) are generated based on a normal distribution and is updated using a Gaussian random number. In this aspect, binary values similar 22 Table 17: The mean accuracy rates with standard deviations of FAM, ELM, FAM- GA, FAM-PSO, ELMM-BSO (m=3, 5, 7) and FAM-BSO (m=3, 5, 7) after removing low-variance features using PCA for the human motion recognition task (for FAM, ρa = 0.7, while for both BSO and PSO θ is set to 0.6). Noise level FAM ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 93.21 81.27 94.66 94.93 91.16 90.54 93.83 95.12 95.62 94.63 ±2.7 ±2.4 ±1.8 ±2.2 ±1.8 ±1.3 ±2.7 ±2.7 ±2.1 ± 2.8 10% 82.45 77.34 85.37 85.42 83.22 82.42 83.65 87.13 86.84 87.57 ±3.2 ±3.8 ±2.7 ±2.2 ±2.5 ±2.7 ±2.5 ±3.5 ±3.8 ± 2.3 Table 18: The mean number of selected features by FAM, ELM, FAM-GA, FAM- PSO, ELMM-BSO (m=3, 5, 7) and FAM-BSO (m=3, 5, 7) after removing low- variance features using PCA for human motion recognition task ( for FAM, ρa = 0.7, while for both PSO and BSO θ is set to 0.6). Noise level FAM & ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 15 6.1 8.6 5.8 5.9 6.3 6.1 6.5 6.6 10% 13 5.4 5.9 6.2 5.1 4.9 5.9 6.1 7.1 to the GA can be used. In addition, BSO uses the distance-based k -mean clustering algorithm to cluster the solutions. A long execution time to measure the distance be- tween each sample and the cluster center is required. This becomes worse when the feature dimension is high. As such, other clustering algorithms could be investigated, in order to accelerate the BSO computational time. 6. Summary In this paper, a new FAM-BSO model, which is an evolutionary-based feature selection method, for solving feature selection problems for data classification has been proposed. Firstly, FAM was used as the underlying model to learn the training samples. Then, the BSO model adopted to search and select the best feature subset that maximizes the classification accuracy while minimizing the numbers of features. The effectiveness of FAM-BSO has been evaluated using ten benchmark data sets and a real-world case study, i.e., human motion detection. The results of FAM-BSO were compared with those of the original FAM and other state-of-the-art methods. Overall, FAM-BSO is able to produce good results, which are similar to, if not better than, those of other methods reported in the literature. However, FAM-BSO still needs a more detailed analysis, so that its robustness can be improved. Our future work is focused on improving FAM-BSO with improved methods for generating and updating new solutions as well as clustering the solutions. In addition, the possibility of using the Variational Auto-Encoders to design a genera- tive model of data for prediction that can be combined with BSO will be investigated. 23 Real-world application on FAM-BSO to other domains will also be studied. Acknowledgment This work is partially supported by the National Natural Science Foundation of China (Grant Nos. 61773197, 61772344, 61811530324, 61732011 and 61761136008), the Nanshan District Science and Technology Innovation Bureau (grant No. LHT- D20170007), the Science and Technology Innovation Committee of Shenzhen City (Grant Nos. CKFW2016041415372174, GJHZ20170314114424152), and the Natural Science Foundation of Shenzhen University (Grant Nos. 827-000140, 827-000230, and 2017060). References [1] E. Hancer, B. Xue, M. Zhang, D. Karaboga, B. Akay, Pareto front feature selec- tion based on artificial bee colony optimization, Information Sciences 422 (2018) 462–479. [2] A. K. Das, S. Sengupta, S. Bhattacharyya, A group incremental feature selection for classification using rough set theory based genetic algorithm, Applied Soft Computing 65 (2018) 400–411. [3] B. Ma, Y. Xia, A tribe competition-based genetic algorithm for feature selection in pattern classification, Applied Soft Computing 58 (2017) 328–338. [4] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological) (1996) 267–288. [5] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Jour- nal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2) (2005) 301–320. [6] Y. Zhang, D. Gong, Y. Hu, W. Zhang, Feature selection algorithm based on bare bones particle swarm optimization, Neurocomputing 148 (2015) 150–157. [7] H. Liu, L. Yu, Toward integrating feature selection algorithms for classification and clustering, IEEE Transactions on knowledge and data engineering 17 (4) (2005) 491–502. [8] W. Siedlecki, J. Sklansky, A note on genetic algorithms for large-scale feature selection, Pattern recognition letters 10 (5) (1989) 335–347. [9] H. Ishibuchi, T. Murata, I. Türkşen, Single-objective and two-objective genetic algorithms for selecting linguistic rules for pattern classification problems, Fuzzy sets and systems 89 (2) (1997) 135–150. 24 [10] M. H. Aghdam, N. Ghasem Aghaee, M. E. Basiri, Text feature selection using ant colony optimization, Expert systems with applications 36 (3) (2009) 6843–6853. [11] D. P. Muni, N. R. Pal, J. Das, Genetic programming for simultaneous feature selection and classifier design, IEEE Transactions on Systems, Man, and Cyber- netics, Part B (Cybernetics) 36 (1) (2006) 106–117. [12] B. Xue, M. Zhang, W. N. Browne, Particle swarm optimization for feature selec- tion in classification: A multi-objective approach, IEEE transactions on cyber- netics 43 (6) (2013) 1656–1671. [13] F. Hafiz, A. Swain, N. Patel, C. Naik, A two-dimensional (2-d) learning frame- work for particle swarm based feature selection, Pattern Recognition 76 (2018) 416–433. [14] J. F. Chang, A performance comparison between genetic algorithms and particle swarm optimization applied in constructing equity portfolios, International Jour- nal of Innovative Computing, Information and Control 5 (12) (2009) 5069–5079. [15] P. Moradi, M. Gholampour, A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy, Applied Soft Com- puting 43 (2016) 117–130. [16] Y. Shi, Brain Storm Optimization Algorithm, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 303–309. [17] G. A. Carpenter, S. Grossberg, N. Markuzon, J. H. Reynolds, D. B. Rosen, Fuzzy artmap: A neural network architecture for incremental supervised learning of analog multidimensional maps, IEEE Transactions on Neural Networks 3 (5) (1992) 698–713. [18] S. Grossberg, Adaptive pattern classification and universal recoding: Ii. feedback, expectation, olfaction, illusions, Biological cybernetics 23 (4) (1976) 187–202. [19] A. Al-Daraiseh, M. Georgiopoulos, G. Anagnostopoulos, A. S. Wu, M. Mol- laghasemi, Gfam: a genetic algorithm optimization of fuzzy artmap, in: IEEE International Conference on Fuzzy Systems, 2006, pp. 315–322. [20] Y. Zhang, H. Ji, W. Zhang, Tppfam: Use of threshold and posterior probability for category reduction in fuzzy artmap, Neurocomputing 124 (2014) 63–71. [21] F. Pourpanah, C. P. Lim, J. M. Saleh, A hybrid model of fuzzy artmap and genetic algorithm for data classification and rule extraction, Expert Systems with Applications 49 (2016) 74–85. [22] B. Efron, Bootstrap methods: another look at the jackknife, in: Breakthroughs in statistics, Springer, 1992, pp. 569–593. 25 [23] S. Cheng, J. Chen, X. Lei, Y. Shi, Locating multiple optima via brain storm optimization algorithms, IEEE Access 6 (2018) 17039–17049. [24] T. Marill, D. Green, On the effectiveness of receptors in recognition systems, IEEE transactions on Information Theory 9 (1) (1963) 11–17. [25] K. J. Wang, K. H. Chen, M. A. Angelia, An improved artificial immune recog- nition system with the opposite sign test for feature selection, Knowledge-Based Systems 71 (2014) 126–145. [26] P. Pudil, J. Novoviov, J. Kittler, Floating search methods in feature selection, Pattern Recognition Letters 15 (11) (1994) 1119 – 1125. [27] A. K. Das, S. Goswami, A. Chakrabarti, B. Chakraborty, A new hybrid feature selection approach using feature association map for supervised and unsupervised classification, Expert Systems with Applications 88 (2017) 81–94. [28] I. S. Oh, J. S. Lee, B. R. Moon, Hybrid genetic algorithms for feature selection, IEEE Transactions on pattern analysis and machine intelligence 26 (11) (2004) 1424–1437. [29] P. Moradi, M. Rostami, Integration of graph clustering with ant colony opti- mization for feature selection, Knowledge-Based Systems 84 (2015) 144 – 161. [30] R. K. Sivagaminathan, S. Ramakrishnan, A hybrid approach for feature subset selection using neural networks and ant colony optimization, Expert systems with applications 33 (1) (2007) 49–60. [31] M. H. Aghdam, N. Ghasem-Aghaee, M. E. Basiri, Text feature selection using ant colony optimization, Expert systems with applications 36 (3) (2009) 6843–6853. [32] D. P. Muni, N. R. Pal, J. Das, Genetic programming for simultaneous feature selection and classifier design, IEEE Transactions on Systems, Man, and Cyber- netics, Part B (Cybernetics) 36 (1) (2006) 106–117. [33] R. A. Davis, A. J. Charlton, S. Oehlschlager, J. C. Wilson, Novel feature selection method for genetic programming using metabolomic 1 h nmr data, Chemometrics and Intelligent Laboratory Systems 81 (1) (2006) 50–59. [34] Y. Mei, S. Nguyen, B. Xue, M. Zhang, An efficient feature selection algorithm for evolving job shop scheduling rules with genetic programming, IEEE Transactions on Emerging Topics in Computational Intelligence 1 (5) (2017) 339–353. [35] D. K. Agrafiotis, W. Cedeno, Feature selection for structure activity correlation using binary particle swarms, Journal of medicinal chemistry 45 (5) (2002) 1098– 1107. 26 [36] X. Wang, J. Yang, X. Teng, W. Xia, R. Jensen, Feature selection based on rough sets and particle swarm optimization, Pattern recognition letters 28 (4) (2007) 459–471. [37] C. L. Huang, J. F. Dun, A distributed pso–svm hybrid system with feature selection and parameter optimization, Applied soft computing 8 (4) (2008) 1381– 1391. [38] B. Chakraborty, Feature subset selection by particle swarm optimization with fuzzy fitness function, in: IEEE International Conference on Intelligent System and Knowledge Engineering (ISKE), Vol. 1, 2008, pp. 1038–1042. [39] R. P. Prado, J. E. Muñoz Expósito, S. Garcı́a Galán, Flexible fuzzy rule bases evolution with swarm intelligence for meta-scheduling in grid computing, Com- puting and Informatics 33 (4) (2015) 810–830. [40] R. E. Precup, M. C. Sabau, E. M. Petriu, Nature-inspired optimal tuning of input membership functions of takagi-sugeno-kang fuzzy models for anti-lock braking systems, Applied Soft Computing 27 (2015) 575–589. [41] S. Vrkalovic, T. A. Teban, I.-D. Borlea, Stable takagi-sugeno fuzzy control de- signed by optimization, Int. J. Artif. Intell 15 (2) (2017) 17–29. [42] A. Unler, A. Murat, A discrete particle swarm optimization method for feature selection in binary classification problems, European Journal of Operational Re- search 206 (3) (2010) 528–539. [43] B. Xue, M. Zhang, W. N. Browne, Particle swarm optimisation for feature se- lection in classification: Novel initialisation and updating mechanisms, Applied Soft Computing 18 (2014) 261–276. [44] T. Butler Yeoman, B. Xue, M. Zhang, Particle swarm optimisation for feature selection: A hybrid filter-wrapper approach, in: IEEE Congress on Evolutionary Computation (CEC), 2015, pp. 2428–2435. [45] Y. Shi, An optimization algorithm based on brainstorming process, Emerging Research on Swarm Intelligence and Algorithm Optimization (2015) 1–35. [46] G. A. Carpenter, S. Grossberg, A massively parallel architecture for a self- organizing neural pattern recognition machine, Computer vision, graphics, and image processing 37 (1) (1987) 54–115. [47] G. A. Carpenter, S. Grossberg, Art 2: Self-organization of stable category recog- nition codes for analog input patterns, Applied optics 26 (23) (1987) 4919–4930. 27 [48] G. A. Carpenter, S. Grossberg, D. B. Rosen, Fuzzy art: Fast stable learning and categorization of analog patterns by an adaptive resonance system, Neural networks 4 (6) (1991) 759–771. [49] G. A. Carpenter, S. Grossberg, J. H. Reynolds, Artmap: Supervised real-time learning and classification of nonstationary data by a self-organizing neural net- work, Neural networks 4 (5) (1991) 565–588. [50] M. Lichman, UCI machine learning repository (2013). URL http://archive.ics.uci.edu/ml [51] C. J. Tan, C. P. Lim, Y. N. Cheah, A multi-objective evolutionary algorithm- based ensemble optimizer for feature selection and classification with neural net- work models, Neurocomputing 125 (2014) 217–228. [52] B. Xue, M. Zhang, W. N. Browne, New fitness functions in binary particle swarm optimisation for feature selection, in: IEEE Congress on Evolutionary Compu- tation (CEC), 2012, pp. 1–8. [53] F. Pourpanah, C. P. Lim, Q. Hao, A reinforced fuzzy artmap model for data classification, International Journal of Machine Learning and Cybernetics (2018) 1–13. [54] F. Pourpanah, B. Zhang, R. Ma, Q. Hao, Non-intrusive human motion recog- nition using distributed sparse sensors and the genetic algorithm based neural network, in: 2018 IEEE SENSORS, 2018, pp. 1–4. [55] C. H. Wang, T. W. Lin, Improved particle swarm optimization to minimize periodic preventive maintenance cost for series-parallel systems, Expert Systems with Applications 38 (7) (2011) 8963 – 8969. [56] L. A. Zadeh, Fuzzy sets, in: Fuzzy Sets, Fuzzy Logic, And Fuzzy Systems: Selected Papers by Lotfi A Zadeh, World Scientific, 1996, pp. 394–432. [57] T. H. Cormen, Introduction to algorithms, MIT press, 2009. [58] T. Burwick, F. Joublin, Optimal algorithmic complexity of fuzzy art, Neural Processing Letters 7 (1) (1998) 37–41. [59] F. Pourpanah, C. J. Tan, C. P. Lim, J. Mohamad-Saleh, A q-learning-based multi-agent system for data classification, Applied Soft Computing 52 (2017) 519–531. [60] V. R. Patel, R. G. Mehta, Impact of outlier removal and normalization approach in modified k-means clustering algorithm, International Journal of Computer Science Issues (IJCSI) 8 (5) (2011) 331–336. 28 [61] J. L. Melsa, D. L. Cohn, Decision and estimation theory, McGraw-Hill series in electrical engineering: Communications and information theory, McGraw-Hill, 1978. Appendix A. Fuzzy ART As shown in Figure A.1, Fuzzy ART network includes three layers: pre-processing layer f0 , input layer f1 and recognition layer f2 . The pre-processing layer performs complement coding in order to avoid the problem of category proliferation, which transforms the M -dimensional input sample, a ∈ [0, 1], into a 2M -dimensional input sample, as follows: A = (a1 , ..., am , 1 − a1 , ..., 1 − am ) (A.1) Figure A.1: The structure of Fuzzy ART. The f0 layer propagates the complement- coded input pattern to the recognition layer (f1 ). Then, the similarity level between the input pattern and each existing prototype patterns in the f2 layer is measured using Eqs. (A.2-A.5). If none of the existing prototype patterns in the f2 layer is able to satisfy the conditions (Eqs. (A.2-A.5)), a new node is added in the f2 layer to encode the current input pattern. The input layer propagates the complement-coded pattern into recognition layer. The recognition layer is a dynamic layer containing prototype nodes, whereby new prototype node can be added whenever it is necessary. During learning, Fuzzy ART receives the complement-coded of input sample (A) to determine the similarity level between current input A and the j -th prototype node in f2 , the choice function is used as follows: 29 |A ∧ Wj | Tj = (A.2) α + |Wj | where α > 0 is the learning parameter, Wj ≡ (wj,1 , ..., wj,2M ) is the weight vector of the j -th prototype node in f2 , and ∧ indicates the fuzzy and operator [56]: (u ∧ v)i ≡ min(ui , vi ) (A.3) where ui and vi are the i -th dimension of A and Wj , respectively. The prototype node with the highest choice score is chosen as the winning node, denoted as node J, as follows: TJ = max(Tj : j = 1, 2, ..., N ) (A.4) If node J satisfies the vigilance criterion, resonance is said to occur. |A ∧ WJ | >ρ (A.5) |A| where ρ is the vigilance parameter of Fuzzy ART. However, if the condition in Eq. (A.5) is not satisfied, a mismatch occurs, whereby the selected node J is deactivated, and a new search cycle is triggered in Fuzzy ART to select a new winning node. This search cycle repeats until one of the existing prototype nodes satisfies the condition in Eq. (A.5), or a new prototype node is created in f2 to encode the current input sample. Then, learning take place in order to update J -th weight vector (WJ ) in f2 as follows: (new) (old) (old) WJ = β(A ∧ WJ ) + (1 − β)WJ (A.6) where β is learning rate of Fuzzy ART. Appendix B. Fuzzy ARTMAP Figure B.1 shows the structure of FAM. It uses two Fuzzy ART models, i.e., ARTa and ARTb , to form a mapping (f ab ) from the input samples to their corresponding outputs. ARTa and ARTb receive the complement-coded input sample (A) and its corresponding target class (B), respectively. When the winning nodes in both ARTa and ARTb are selected, the map-field vigilance test is applied as follows: |y b ∧ WJab | > ρab (B.1) |y b | where WJab is the weight vector from f2a to f ab , ρab is the map-field vigilance parameter, and y b indicates the output vector of f2b , which is defined as follows: ( 1, k = K yb = (B.2) 0, otherwise 30 $57D 0DSILHOG $57E ID IDE IE ID IE D E D E D0 EN 0DWFK WUDFNLQJVLJQDO PDSILHOG $57DYLJLODQFH YLJLODQFH $57EYLJLODQFH Figure B.1: The structure of FAM. It consists of two fuzzy ART models, i.e., ARTa and ARTb , which are connected through a map-field. ARTa and ARTb receive the complement-coded input sample and its corresponding label, respectively. Once the wining nodes in ARTa (J) and ARTb (K) are found, the map-field vigilance test (Eq. (B.1)) is used to map the J-th wining node (with respect to the input sample) in W a to the K-th winning node (with respect to the target label) in W b . If the condition in Eq. (B.1) is not satisfied, a new search cycle is triggered to learn current input sample. where K is the winning ARTb prototype node. If condition in Eq. (B.1) fails, it means the current J -th winning node in f2a makes an incorrect predicted class in ARTb . To correct this erroneous prediction, a match-tracking procedure is triggered to update the vigilance parameter of ARTa , as follows: |A ∧ WJa | ρa = +δ (B.3) |A| where δ > 0. Then, a new search cycle with the updated ρa setting ensues. This process ends once the map-field vigilance test is satisfied. As such, learning takes place in which the J -th node in f2a is updated to: a(new) a(old) a(old) WJ = βa (A ∧ WJ ) + (1 − βa )WJ (B.4) where βa is learning rate of ARTa . 31 Appendix C. Complexity Analysis of FAM-BSO In this section, the time complexity analysis of FAM-BSO is conducted using the big-O notation [57]. The analysis is divided into two parts: FAM and BSO. As stated in Section 3.2, FAM consists of two fuzzy ART models and a map-field. Let M and N denote the number of input features and number of prototype nodes, respectively. As reported in [58] and [59], the worst-case time complexity of Fuzzy ART model is N 2 + M , which is O(N 2 ) when M → ∞ and N → ∞. However, in the map-field, the worst-case scenario is when the match-tracking process happens for all prototype nodes in ARTa . Let Ma (Mb ) and Na (Nb ) be the number of features and number of nodes in ARTa (ARTb ), respectively. In the worst-case scenario, FAM requires a total of (Na2 + Ma Na ) + (Nb2 + Mb Nb ) + Na (Na2 + Ma Na ) time-complexity for ARTa , ARTb , and the map-field, respectively, which is O(N 3 ) Ma → ∞, Na → ∞, Mb → ∞ and Nb → ∞. In BSO, in addition to n solutions/individuals with P dimensions, the procedure related to clustering using k -mean algorithm (where m=3, 5 and 7), disrupting the cluster center, and updating individuals are used. This procedure is shown in Figure 2. According to [60], the time complexity of k -mean clustering is O(n ∗ k ∗ i ∗ d) where n, k, i and d are the number of samples, number of clusters, number of iterations, and number of features, respectively, which can be simplified because k and d are constant numbers, and the number of iterations, i, in the worst-case scenario equals to the number of samples, n. Therefore, the maximum time complexity of k -mean clustering is O(n2 ). The worst-case time complexity of disrupting the cluster center is O(P ). The updating operator adds a uniform random value to each dimension of the selected solutions (single cluster or two combined clusters). The worst-case time complexity for the updating process is O(P ). The maximum step to compute fitness value is measuring the choice value for all prototype nodes, which is O(Na ). As such, the worst-case time complexity for BSO is O(n2 ) + O(n) + O(P ) + O(Na ), which is O(n2 ) when all variables extend to infinity. Therefore, the worst-case time complexity of FAM-BSO can be determined by combining the time complexity of FAM O(Na3 ) and BSO O(n2 ), which is O(Na3 ) when all variables extend to infinity. Appendix D. Performance indicators The t-test is a pairwise statistical test which can be used to compare two models, as follows: mean1 − mean2 T − value = q (D.1) SP ∗ n2 where mean1 and mean2 represent the mean accuracy rates of models 1 and 2 over samples, respectively, and SP is defined as follows: r var12 + var22 SP = (D.2) 2 32 where var1 and var2 indicate the variance of models 1 and 2, respectively. The sensitivity/specificity [61] are the ratio of correctly predicted positive/negative samples over the total number of positive and negative samples, respectively. The F-measure is computed using: precision × recall F − measure = 2 . (D.3) precision + recall where precision is computed using: true positive precision = (D.4) true positive + f alse positive Appendix E. T-test, F-measure, Sensitivity and Specificity Analysis for UCI Data Sets Tables E.1 compares the mean and median of accuracy and number of selected features by FAM-BSO for UCI data sets, based on 10-fold cross-validation, with averages of individual folds computed from 30 repeated runs. As can be seen there is no significant difference between the mean and median results with respect to number of selected features. The largest difference between mean and median accuracy is lower than 2.4%, while that of the number of selected features is lower than 1. Table E.1: Comparison between mean and median of the accuracy and number of selected features of FAM-BSO for UCI data set (ρa = 0.8, θ = 0.6 and m = 5) Data set Accuracy # of selected features Mean Median Difference Mean Median Difference MK1 83.48 84.02 -0.54 66.63 66.6 0.03 Sonar 86.77 85.71 1.06 24.62 24.85 -0.23 Wine 95.45 97.40 -1.95 5.26 5.26 0 German 76.97 75.33 1.64 10.82 10.06 0.76 WBCD 95.49 95.60 -0.11 12.11 12.98 -0.87 Vehicle 74.87 77.23 -2.36 7.43 7.41 0.02 Ionosphere 88.88 86.79 2.09 3.89 13.83 0.06 PID 73.86 74.5 -0.64 3.32 3.30 0.02 Zoo 95.85 95.09 0.76 6.94 7.04 -0.1 Hill-valley 57.62 56.89 0.73 39.94 40.91 -0.97 Table E.2 shows the t-test scores for pairwise comparison between FAM-BSO with different numbers of clusters and FAM, FAM-GA and FAM-PSO. An analysis of the results are as follows: • FAM-BSO versus FAM: Out of 30 comparative results (10 data sets, each with cluster number of 3, 5, and 7), FAM-BSO outperforms FAM 14 times, ties with FAM 8 times, and is inferior to FAM 8 times; 33 • FAM-BSO versus FAM-GA: Out of 30 comparative results, FAM-BSO outper- forms FAM-GA 25 time, ties with FAM-GA 3 times, and is inferior to FAM-GA 2 times; • FAM-BSO versus FAM-PSO: Out of 30 comparative results, FAM-BSO out- performs FAM-PSO 24 times, ties with FAM-PSO 3 times, and is inferior to FAM-PSO 3 times. Based on the above analysis, FAM-BSO is able to outperform FAM, FAM-GA, and FAM-PSO with statistically significant difference in results for majority of the evaluations. Table E.2: The t-test results for pairwise comparison of FAM-BSO versus other meth- ods for the UCI data sets (for FAM, ρa = 0.7; for both PSO and BSO, θ is set to 0.6). Symbol ”+”/”-” indicates that FAM-BSO performs significantly better/worse than the indicated method, while symbol ”=” means both achieves a similar performance. level of noise m FAM FAM-GA FAM-PSO Data set m FAM FAM-GA FAM-PSO 7 + + + 7 + + + MK1 5 + + + Vehicle 5 + + + 3 + + + 3 + + + 7 - - = 7 + + + Sonar 5 = + + Ionosphere 5 - + + 3 + + + 3 + + + 7 - + + 7 + + + Wine 5 + + + PID 5 + + + 3 = + + 3 = + + 7 = + + 7 + + + German 5 - + - Zoo 5 + + + 3 + + + 3 - - + 7 + + + 7 - + - WBCD 5 + + + Hill-valley 5 - + - 3 + + + 3 - + + F-score rates of FAM-GA, FAM-PSO and FAM-BSO with different numbers of clusters are shown in Table E.3. Note that for all models ρa is set to 0.7 and θ for both FAM-PSO and FAM-BSO is set to 0.6. FAM-BSO with different numbers of clusters produces high F-measure scores for all datasets (Table E.3). The sensitivity and specificity of FAM-GA, FAM-PSO and FAM-BSO with differ- ent numbers of clusters for datasets from machine learning repository are shown in Tables E.4 and E.5, respectively. As can be seen, FAM-BSO with different numbers of clusters achieves good sensitivity rates. In particular; FAM-BSO with m=3 produces the highest sensitivity rate for three out of seven data sets, i.e., Sonar, German and Hill-valley, and the highest specificity rate for three out of seven data sets, i.e., KM1, German and Ionosphere. Overall, all three models (i.e., FAM-GA, FAM-PSO and 34 Table E.3: The F-measure scores for the UCI data sets (ρa = 0.7, and for both PSO and BSO θ is set to 0.6) Data set FAM-GA FAM-PSO FAM-BSO m=3 m=5 m=7 MK1 0.8882 0.8932 0.8569 0.9036 0.9016 Sonar 0.8264 0.7933 0.8561 0.8332 0.7979 German 0.8415 0.8794 0.8976 0.8720 0.8876 WBCD 0.9486 0.9579 0.9675 0.9665 0.9751 Ionosphere 0.9056 0.9186 0.9333 0.9257 0.9366 PID 0.7847 0.8222 0.8247 0.8325 0.8411 Hill-valley 0.5774 0.6044 0.6401 0.5085 0.4836 FAM-BSO) are able to achieve a balanced sensitivity and specificity rates for MK1, WBCD, Ionosphere and Hill-valley data sets. Table E.4: The sensitivity rates for the UCI data sets (ρa = 0.7, and for both PSO and BSO θ is set to 0.6) Data set FAM-GA FAM-PSO FAM-BSO m=3 m=5 m=7 MK1 0.8649 0.8838 0.8665 0.8934 0.8777 Sonar 0.7702 0.7240 0.8133 0.7532 0.7306 German 0.8351 0.8786 0.9008 0.8658 0.8850 WBCD 0.9457 0.9606 0.9541 0.9662 0.9601 Ionosphere 0.9490 0.9608 0.9537 0.9567 0.9650 PID 0.7814 0.8358 0.8506 0.8548 0.8752 Hill-valley 0.5661 0.6186 0.6842 0.4808 0.4814 35 Table E.5: The specificity rates for the UCI data sets (ρa = 0.7, and for both PSO and BSO θ is set to 0.6) Data set FAM-GA FAM-PSO FAM-BSO m=3 m=5 m=7 MK1 0.8946 0.8773 0.9111 0.8919 0.9109 Sonar 0.9148 0.9081 0.9211 0.9502 0.9087 German 0.6509 0.7211 0.7516 0.7200 0.7458 WBCD 0.9678 0.9718 0.9886 0.9792 0.9944 Ionosphere 0.7262 0.7554 0.8324 0.7946 0.8218 PID 0.6110 0.6346 0.6072 0.6320 0.6109 Hill-valley 0.6181 0.6054 0.5609 0.6030 0.5825 36 *Declaration of Interest Statement