Applied Soft Computing Journal 80 (2019) 761–775 Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc Feature selection based on brain storm optimization for data classification ∗ Farhad Pourpanah a , , Yuhui Shi b , Chee Peng Lim c , Qi Hao b , Choo Jun Tan d a College of Mathematics and Statistics, Shenzhen University, China b School of Computer Science and Engineering, Southern University of Science and Technology, China c Institute for Intelligent Systems Research and Innovation, Deakin University, Australia d School of Science and Technology, Wawasan Open University, Malaysia highlights • A hybrid model (FAM-BSO) for feature selection and data classification is proposed. • Fuzzy ARTMAP (FAM) is first used for incremental learning of data. • A Brain Storm Optimization (BSO) is then used for feature selection. • The outcome indicates that FAM-BSO is able to produce promising results. article info a b s t r a c t Article history: Brain storm optimization (BSO) is a new and effective swarm intelligence method inspired by the Received 21 March 2018 human brainstorming process. This paper presents a novel BSO-based feature selection technique for Received in revised form 8 April 2019 data classification. Specifically, the Fuzzy ARTMAP (FAM) model, which is employed as an incremental Accepted 22 April 2019 learning neural network, is combined with BSO, which acts as a feature selection method, to produce Available online 8 May 2019 the hybrid FAM-BSO model for feature selection and optimization. Firstly, FAM is used to create a Keywords: number of prototype nodes incrementally. Then, BSO is used to search and select an optimal sub-set of Feature selection features that is able to produce high accuracy with the minimum number of features. Ten benchmark Brain storm optimization problems and a real-world case study are employed to evaluate the performance of FAM-BSO. The Fuzzy ARTMAP results are quantified statistically using the bootstrap method with the 95% confidence intervals. Data classification The outcome indicates that FAM-BSO is able to produce promising results as compared with those from original FAM and other feature selection methods including particle swarm optimization, genetic algorithm, genetic programming, and ant colony optimization. © 2019 Elsevier B.V. All rights reserved. 1. Introduction Feature selection techniques consist of two parts: (i) a search technique to find the optimal subset of features, and (ii) a classi- fier or learning algorithm to evaluate the effectiveness of the se- Feature selection identifies an optimal subset of features from lected subset of features. In general, feature selection techniques data samples for classification problems. Feature selection meth- can be divided into three categories: filter-based, wrapper-based ods aim to improve the predictive accuracy and/or reduce the and embedded-based methods [2]. Filter-based methods mainly computational time of classification algorithms. They are impor- focus on the properties of data samples without considering tant in data mining and pattern recognition problems. However, the underlying learning scheme [3]. In contrast, wrapper-based selecting an optimal subset of features is a challenging optimiza- methods employ a classifier or a learning algorithm to evaluate tion task, and is time consuming due to the large search space the effectiveness of various feature subsets, and adopt a search and complex relationships among the features [1]. technique to find the optimal one. Unlike wrapper methods, em- bedded methods considers feature selection in the training pro- cess, in order to reduce the computational time for re-classifying ∗ Corresponding author. different feature subsets. Regularization methods, e.g., the least absolute shrinkage and selection operator (LASSO) [4] and Elastic E-mail addresses:

[email protected]

(F. Pourpanah),

[email protected]

(Y. Shi),

[email protected]

(C.P. Lim),

[email protected]

(Q. Hao), Net [5], are popular embedded methods. Since wrapper-based

[email protected]

(C.J. Tan). methods consider both feature subset and classifier, they are https://doi.org/10.1016/j.asoc.2019.04.037 1568-4946/© 2019 Elsevier B.V. All rights reserved. 762 F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 generally more effective than filter based methods. Neverthe- requires longer execution durations owing to the procedure of less, wrapper methods are more complex, because a classifier is clustering its solutions (as shown in Section 5.2). In summary, the required to re-learn each selected feature subset [6]. main contributions of this paper are two-fold: Many wrapper based feature selection methods have been proposed to perform the search process and find the optimal • a hybrid FAM-BSO model that is able to maximize classifi- subset of features, such as sequential forward selection (SFS) cation accuracy and minimize the number of features; and sequential backward selection (SBS) [7]. While both methods • a comprehensive evaluation of FAM-BSO in feature selec- can produce promising results, they suffer from several problems tion and data classification using benchmark and real-world such as nesting effects [3] and computational complexity [1]. To problems, and performance comparison pertaining to the overcome these problems, population-based evolutionary com- feature selection capability of BSO (which constitutes a new putation (EC) algorithms such as the genetic algorithm (GA) [8, application of BSO) with other EC-based algorithms. 9], ant colony optimization (ACO) [10], genetic programming The rest of this paper is organized as follows. Section 2 (GP) [11] and particle swarm optimization (PSO) [12], have been presents a review on both traditional and EC-based feature se- widely used. These algorithms have shown promising results in lection methods. Section 3 explains the structures of BSO and solving single-, multi- and many-objective optimization problems Adaptive Resonance Theory (ART) models. The details of proposed due to their capability of conducting global search. FAM-BSO model is described in Section 4. Section 5 provides Many EC methods have been adopted for feature selection. the experimental results and discussion. Finally, conclusions and Each method has its own advantages and disadvantages. As an suggestions for further research are given in Section 6. example, the learning process of PSO, which is based on the Euclidean distance [13], makes it difficult in terms of adaptability. 2. Related work Nevertheless, PSO is less complex, both in run-time and memory requirements, and has the capability of fast convergence as com- This section reviews both traditional and EC-based feature pared with the GA and GP models [14,15]. PSO also has a simple selection methods. Two well-known traditional feature selection structure, which requires few number of parameters to adjust. methods are SBS [24] and SFS [25]. These methods sequentially These promising properties of PSO make it a useful approach, add or remove features until no improvement in classification which has been applied to many fields including feature selection. accuracy is observed. Once certain features are removed or se- Comparatively, brain storm optimization (BSO) [16] is a new lected, they cannot be updated in future steps, which is the main swarm intelligence algorithm inspired by the human brainstorm- drawback of the SBS and SFS methods. To alleviate this problem, ing process. To the best of our knowledge, the effectiveness of BSO sequential forward floating selection (SFFS) and sequential back- in feature selection problems is yet to be investigated, which is ward floating selection (SBFS) algorithms are devised [26], which the key focus of this paper. use a floating search strategy with SFS and SBS, respectively. Both On the other hand, fuzzy ARTMAP (FAM) [17] is a supervised algorithms produce reasonably good results as compared with neural network that merges the capability of Adaptive Resonance those from branch and bound methods. Theory (ART) [18] in solving the stability–plasticity dilemma with Since the traditional methods usually evaluate all possible the capability of fuzzy set theory in handling vague and imprecise solutions, and then select the best feature subset; it is computa- human linguistic information. FAM is an incremental learning tionally expensive, especially when the feature dimension is high. model that operates by determining the similarity degree among A feature selection method based on feature association map to its existing prototype nodes and the input samples against a tackle both supervised and unsupervised problems was proposed threshold. If the similarity measure is not satisfied, FAM is able in [27]. The concept of graph-theoretic is used to select significant to incrementally add a new prototype node in its structure to features. While the proposed method produces acceptable results, encode the current learning sample, without forgetting or cor- it is not able to select the optimal feature subset by employing the rupting previously learned samples. It has been used with the GA maximum independent set or minimum vertex. to solve data classification problems [19–21]. To alleviate the limitations of traditional methods, EC-based In this study, we propose a hybrid FAM-BSO model for feature algorithms have been applied to tackle feature selection prob- selection. Firstly, FAM is used to learn the input samples incre- lems. The GA has been the first EC-based technique used for mentally. Then, BSO is adopted to search and select an optimal feature selection [8], which outperforms sequential search. Later feature subset. The proposed FAM-BSO model exploits the con- in [9], single objective and two-objective GA-based techniques cept of ‘‘open prototype’’ to keep the most important input fea- are proposed for both feature selection and rule extraction. In tures while maximizing classification accuracy. Ten benchmark [28], the local search operations are embedded into the GA to problems are employed to evaluate the effectiveness of FAM- devise a hybrid GA (HGA) to tackle feature selection problems. BSO. The results are quantified statistically using the bootstrap The results show that HGA outperforms the conventional GA. method [22] with 95% confidence intervals. In addition, a real- In [29], a novel feature selection method using clustering world human motion recognition problem is used to demonstrate and ACO has been proposed to tackle data classification tasks. It the applicability of FAM-BSO. partitions all features into several subsets, and ACO is used to find Building upon our previously proposed fuzzy based feature the optimal one. The proposed method produces better results selection and rule extraction model reported in [21], the main as compared with those from both filter and wrapper methods. contribution of this study is to hybridize a nature-inspired op- In [30], a hybrid model consisting of ACO and artificial neural timization algorithm, BSO, with a neural-fuzzy algorithm (FAM) networks for feature selection has been developed. The resulting to undertake feature selection and classification problems. Unlike hybrid model is applied to medical data sets, and promising re- the GA and PSO models which use the best solution and global sults are reported. In [31], an ACO-based feature selection method best solution, respectively, to update the individual solutions, has been proposed to tackle text-related classification tasks. It the brainstorming process of BSO use all possible solutions to shows superior results in tackling the Reuters-21578 data set as generate new ones. As compared with the GA and PSO models, compared with those of the GA and information gain. this updating mechanism helps BSO to scape from local optima. GPmtfs [32] is an online feature selection technique based on GP. While, PSO has a fast convergence rate, it is not suitable for It constructs a population of classifiers with randomly selected multi-model optimization problems [23]. Nevertheless, FAM-BSO feature subsets. To solve c-class classification problems, GPmtfs F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 763 equips a classifier with c trees. GPmtfs produces good results, as 3.1. Brain storm optimization compared with those of SFS and SBS. In [33], a two-stage GP method for feature selection without suffering from the over- The brainstorming process has been widely used for problems fitting problem has been developed. The proposed method con- that cannot be solved by one individual. To overcome these verges faster and achieves better classification accuracy rates as problems, individuals from different background are gathered to compared with those of conventional GP. Recently, NiSuFs [34], brainstorm. The main aim of brainstorming is to generate as which is a niching-based GP feature selection technique, has been many different ideas (solutions) as possible, and good solutions proposed to select an optimal feature subset. Since NiSuFs uses can be obtained to solve a specific problem from the generated a surrogate model, it only requires 10% of the conventional GP ideas [16,45]. training time to find the optimal feature subset. BSO [16] is a new population-based swarm intelligence method The PSO algorithm has been first used in [35] to tackle feature inspired by the human brainstorming process. BSO generates selection problems. A binary PSO algorithm has been combined n random possible solutions, and evaluates them based on a with a feed-forward neural network to construct a parsimonious fitness function. According to [45], BSO contains three steps, quantitative structure–activity relationship (QSAR) model. Since which are clustering individuals, disrupting the cluster centers, then, many PSO-based models have been successfully developed and creating solutions. BSO clusters n individuals into m clusters to solve feature selection problems, due to its simple structure, using the k-means clustering algorithm. Similar solutions are fast convergence, and cost-effectiveness (in terms of both mem- clustered together during each generation. A new solution is ory and run time). As an example, a feature selection method generated with a probability of P, and it replaces a cluster center based on PSO and rough set proposed in [36] is able to achieve (selected randomly), through a disrupting cluster center opera- better results for rough set-based feature selection as compared tion. Finally, BSO generates a new individual using one cluster, with the conventional GA. A hybrid model of PSO-SVM (Support or by combining two clusters. To generate a new individual, BSO Vector Machine) has been introduced in [37] to improve clas- randomly selects one or two cluster(s), with a probability of P- sification accuracy while choosing a small feature subset. PSO one. Then, BSO randomly selects one individual based on one or is used to simultaneously optimize SVM parameters and select two cluster(s) center(s), as follows: the best feature subset. In [38], a PSO-based feature selection { method, which employs a fuzzy-based fitness function, has been Xi , one cluster Xselected = (1) rand × X1i + (1 − rand) × X2i , t w o clusters introduced. It is able to achieve similar results as compared with those of the conventional GA. where X1i and X2i are the ith dimension of the selected clusters, Many EC-based techniques have been combined with fuzzy and rand is a random value between 0 and 1. BSO updates the systems. In [39], a generalized knowledge acquisition method selected individual as follows: combined with a swarm intelligence model called KASIA-G has been proposed. KASIA-G reduces the number of experiments by Xnew = Xselected + ξ ∗ random(0, 1) (2) allowing the canonical algorithm to use the knowledge bases with where random is a Gaussian random value with 0 mean and unit different sizes. The outcome indicates that KASIA-G is able to variance, respectively; ξ is the adjusting factor, i.e., produce similar results as those of KASIA and Pittsburgh meth- ods in learning fuzzy rules from meta-schedulers pertaining to 0.5 ∗ mi − ci ξ = logsin( ) × rand (3) grid computing. In [40], two EC-based optimization techniques, k i.e., PSO and simulated annealing (SA) algorithms, have been used where mi and ci are the maximum number of iterations and the to find optimal Takagi–Sugeno–Kang (TSK) fuzzy models for anti- current iteration, respectively; logsin() indicates the logarithmic lock braking systems. In [41], PSO, SA and gravitational search sigmoid function, rand() is a random value between 0 and 1, and (GS) techniques have been employed to tune the parameters of k is a changing rate for the slope of logsin() function. The detailed the TSK fuzzy controller, which uses the linear matrix inequality steps of BSO are summarized in Fig. 1. as constraints in the optimization problems. In [42], an adaptive feature selection method based on discrete PSO has been developed to solve binary classification problems. 3.2. Adaptive resonance theory The developed method depicts promising results as compared with those of scatter search (SS) and tabu search (TS). In [43], new Adaptive Resonance Theory (ART) [18] is known as a hu- initialization strategies and updating procedures have been pro- man cognitive information processing theory which has led to posed for PSO to solve feature selection problems. The developed evolve many online neural network models, such as ART1 [46], technique outperforms standard PSO and other traditional feature ART2 [47], Fuzzy ART [48] for solving unsupervised problems and, selection methods. Later, two PSO-based hybrid filter-wrapper ARTMAP [49] and Fuzzy ARTMAP (FAM) for solving supervised feature selection techniques, i.e., FastPso and RapidPSO, [44] have problems, just to name a few. These models can stably learn been proposed. Both methods reduce the complexity of PSO- to categorize input patterns that are presented in an arbitrary based models, while achieving similar or better accuracy rates sequence. when all features are used for classification. Fuzzy ART is the most prominent member of unsupervised ART architectures with capability of clustering both binary and analog input patterns. It has been used to develop a hierarchical 3. The structure and dynamics of BSO and ART models network, known as FAM, to map M-dimensional input vectors into N-dimensional output vectors. Dynamics of Fuzzy ART and In this section, we firstly explain the BSO model. Then, the FAM are explained in detail in Appendices A and B, respectively. structure of ART, specifically, Fuzzy ART and FAM models, are The step-by-step of the learning phase of FAM is summarized in described in detail. Algorithm 1. 764 F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 Fig. 1. The BSO procedure (adopted from [16]). Firstly, BSO generates n random solutions and evaluates them. After clustering the solutions into m groups (using the k-mean clustering algorithm), they are ranked based on their fitness values, and the best solution in each cluster is set as the cluster center. Next, based on probability p5a , it randomly disrupts a cluster center. Then, each solution is updated based on one/two cluster center(s) or solution(s) which are randomly selected based on the probabilities P6biii and P6c , respectively. Finally, if the new solution performs better than the current solution, replacement takes place. 4. The FAM-BSO model generate the ‘‘open’’ prototypes, which enable FAM to include the ‘‘don’t care’’ antecedent [9]. In FAM, the ‘‘don’t care’’ feature can The proposed FAM-BSO model contains two main stages, be satisfied by setting low and high vectors of the corresponding which are the learning stage and feature selection stage. FAM is dimension with 0 and 1, respectively. An example of an ‘‘open’’ used in the first stage to learn the training samples, and BSO is prototype node for an input pattern with three features is shown employed in the second stage to extract the best feature subset in Table 1. The number of all possible ‘‘open’’ prototype nodes, with the aim to increase classification accuracy and reduce model except one with all ‘‘don’t care’’ antecedents, is (2d − 2), where d complexity. The details are explained as follows. is the dimension of the input pattern. 4.2. Adaptation of BSO for feature selection 4.1. Open prototype node In BSO, each solution, S, contains p × d features, as follows: Once FAM learning (as explained in Section 3.2) is completed, p p all the created prototype nodes in the f2a layer are employed to S = {D11 , D12 , . . . , D1d , D21 , D22 , . . . , D2d , . . . , D1 , . . . , Dd } (4) F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 765 Algorithm 1 The learning phase of FAM Algorithm 2 The procedure for measuring the fitness value for a Input: Parameters of FAM and training samples. single solution Output: Parameters of trained FAM. Input: Parameters of the trained FAM, Validation samples, Gen- 1: for each training sample (a1 , ..., am ) do erated solution S by BSO 2: Complement-coding (Eq. (A.1)). Output: Fitness value 3: Calculate the choice function for all prototype nodes (W a ) 1: Create the "open" prototypes. (Eq. (A.2)). 2: Initialize the "don’t care" features (Eqs. (4) and (5)) by setting 4: Select the winning node using Eq. (A.4) (Jth node in W a ). low and high vectors of the corresponding dimension with 0 5: Perform the vigilance test (Eq. (A.5)) for the winning node. and 1, respectively. 6: if the vigilance test is not satisfied then 3: for each validation sample do 7: Deactivate the winning node (Tj = 0). 4: Complement-coding (Eq. (A.1)). 8: if all prototype nodes in W a are deactivated then 5: Calculate the choice value for all of the prototype nodes 9: Add new node. (Eq. (A.2)). 10: else 6: Select the winner node using Eq. (A.4) (Jth node in W a ). 11: Go to Step 3. 7: Perform vigilance test (Eq. (A.5)) for the winner node. 12: end if 8: if condition in Eq. (A.5) is satisfied then 13: end if 9: Select the prototype node as the winning node (Jth). 14: (Repeat Steps 2-13 simultaneously for the target vector of 10: else the current input and select the winning prototype node(K th) 11: Predict the current validation sample as Unknown in W a . target. 15: Perform the map-field vigilance test (Eq. (B.1)). 12: end if 16: if the condition in Eq. (B.1) is not satisfied then 13: Update the performance indicator. 17: Perform match-tracking (Eq. (B.3)). 14: end for 18: Deactivate the winning node (Jth) in W a . 15: Compute the fitness value using Eq. (6). 19: Go to Step 3. 20: else 21: Update the winning node using Eq. (B.4). are generated based on Eq. (4). Next, for each solution, ‘‘open’’ 22: end if prototypes are created using Eq. (5), and the fitness value is 23: end for measured using the validation set. To determine the fitness value, FAM-BSO compute the choice function of each validation sample Table 1 using Eq. (A.2), and then conducted the vigilance test for the Original and possible open prototype nodes for an input pattern with three winning prototype node using Eq. (A.5). After that, all solutions features. are clustered into m groups using k-mean clustering. The cluster Prototype nodes Feature 1 Feature 2 Feature 3 center is disrupted before updating the individuals using Eqs. (1), Original prototype node A B C (2) and (3). After that, ‘‘open’’ prototypes are created and the Open prototype node 1 A Don’t care C fitness value of each updated solution is measured. If the new Open prototype node 2 Don’t care B Don’t care Open prototype node 3 Don’t care Don’t care C solution performs better than the current solution, replacement Open prototype node 4 A B Don’t care takes place. Finally, if the termination condition is satisfied, the Open prototype node 5 A Don’t care C best feature subset is used to evaluate performance using the test Open prototype node 6 Don’t care B C set. The time complexity of FAM-BSO is analyzed in Appendix C. 5. Experimental studies where p is the number of prototype nodes in the layer and f2a p d indicates the dimension of each prototype node. Then, Dd is To evaluate the effectiveness of FAM-BSO, ten benchmark data considered as follows: sets from the UCI machine learning repository [50] and a real- { p world case study, i.e., human motion detection [51], are used. p don′ t care feature, if Dd < θ The details of the UCI data sets are listed in Table 2. The selected Dd = p (5) other feature, if Dd ≥ θ data sets contain different feature characteristics, which include where 0 < θ < 1. the numbers of features and samples that are used to assess The BSO fitness function is formulated to reduce the classifi- the feature selection and classification capabilities of FAM-BSO. cation error, as follows: The PID samples overlapped each other, which is difficult for classification. Ionosphere has fewer overlapping samples. Musk NCP Fitness = 1 − (6) version 1 and Hill-valley are high-dimension problems. Sonar, TNS German, Wine and WBCD poses moderate imbalance samples. where NCP indicates the number of correctly classified samples Vehicle and Zoo are multi-class data sets. Zoo has fewer numbers and TNS presents the total number of samples. The aim is to of samples, but highly imbalanced. In addition, these data sets minimize the error rate. have employed for performance comparison with other EC-based The step-by-step measurement of the fitness value of a single feature selection methods reported in the literature. solution, S, is summarized in Algorithm 2. In addition to accuracy and complexity (the numbers of pro- totype nodes and selected features), four performance indicators 4.3. Summary of the proposed FAM-BSO model i.e., t-test, F-measure, sensitivity (recall), and specificity, are used for performance comparison (Appendix D). Note that sensitivity Fig. 2 shows the flow chart of FAM-BSO. The data set is firstly and specificity are applicable to two-class problems; therefore split into three subsets, i.e., learning, validation and test sets. they are excluded in multi-class data sets (i.e., Wine, Vehicle and After FAM is trained using the learning set, n random solutions Zoo). 766 F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 Fig. 2. The proposed FAM-BSO. Firstly, FAM-BSO uses FAM to learn based on training samples. Next, it generates n random solutions and evaluates them using the validation samples and Algorithm 2. After clustering the solutions into m groups, they are ranked in each group based on their fitness values, and the best solution in each group is selected as the cluster center. Then, it randomly disrupts a cluster center. Each solution is updated based on one/two cluster center(s) or solution(s) which is/are randomly selected. This procedure is repeated until the termination condition is satisfied. Finally, the selected feature subset is used to form the test samples for evaluation with Algorithm 2. Table 2 the results statistically, each fold is repeated 30 times, and the Details of the UCI data sets. bootstrap method [22] is used to compute the 95% confidence Data set Number of Number of Number of intervals. All data samples are normalized between 0 and 1. input features data samples classes Table 4 shows the accuracy rates with the 95% confidence Musk Version 1 (MK1) 166 476 2 intervals of the original FAM and FAM-BSO with ρa = 0.8 and Sonar 60 208 2 Wine 13 178 3 different θ settings from 0.3 to 0.9. In general, original FAM and German 24 1000 2 FAM-BSO performs depict similar performances statistically, as WBCD 30 569 2 there are overlaps between the 95% confidence intervals of their Vehicle 18 846 4 accuracy rates. Nevertheless, FAM-BSO uses fewer numbers of Ionosphere 34 351 2 features, as compared with those of the original FAM, as shown PID 8 756 2 Zoo 17 101 7 in Table 5. FAM-BSO with θ = 0.5 outperforms 4 (i.e., Sonar, Hill-valley 100 606 2 Wine, German and Vehicle) out of 10 data sets, as compared with the original FAM and FAM-BSO with other θ settings. In addition, Table 5 shows the average number of selected features Table 3 Parameters of FAM-BSO (parameters of BSO are adopted from [16]). for each prototype node by FAM-BSO and the original FAM (note n 100 mi 1000 αa 0.001 that original FAM uses all features). As expected, the number of P5a 0.2 σ 1 αb 0.001 selected features reduces with increasing θ value. P6b 0.8 m {3, 5, 7} ρab 1 A performance comparison between FAM-BSO and other P6biii 0.4 µ 0 ρb 1 methods reported in literature is conducted. In the experiments, P6c 0.5 βa 1 ρa [0.2, 0.9] k 20 βb 1 θ [0.3, 0.9] θ and ρa are set to 0.5 and 0.8, respectively. Specifically, FAM- BSO is compared with three PSO-based models reported in [43] and [52], the feature association map (U-FAM) [27], GP-based feature selection (GPmtfs ) [32], and ACO-based feature selection The experimental parameters are listed in Table 3, which are (ACO-ER) [1]. To have a fair comparison, the same procedure used adopted from [16]. All experiments are conducted using MATLAB in [1,27,32,43,52] is adopted, i.e 70% and 30% of the data samples 2017a with an Intel 4 GHz CPU and 8 GB memory. are used for training and test, respectively. Table 6 shows the accuracy rates of FAM-BSO, PSO (4-1) [43], 5.1. Experiments with the UCI data sets PSO (3-1) [43], BPSO-2Stage [52], U-FAM [27], and ACO-ER [1]. As shown in Table 6, FAM-BSO achieves high accuracy rates is Sonar, In this section, firstly, θ is varied from 0.3 to 0.9, in order to Wine, German, WBCD, PID, and zoo data sets. While FAM-BSO find an optimal setting that yields high accuracy rates with few does not outperform other methods for MK1 and Ionosphere data numbers of features. The results are compared with those of the sets, it produces similar results as those of PSO-2Stage and ACO- original FAM and other state-of-the-art models reported in the ER, respectively. In addition, Table 7 shows the average numbers literature. Then, different ρa setting ranging from 0.2 to 0.9 is of selected features of FAM-BSO, PSO (4-1) [43], PSO (3-1) [43], used in order to compare FAM-BSO with other EC-based feature (U-FAM) [27], and ACO-ER [1]. FAM-BSO with θ = 0.5 does selection algorithms, i.e., FAM-GA and FAM-PSO. not produce the smallest number of features as compared with A total of 70% and 30% of the data samples are used for those of other models. It selects approximately 50% of all original training (i.e., 60% for learning, and 10% for validation) and test, features. respectively. The 10-fold cross-validation method is adopted in We have developed FAM-GA and FAM-PSO for performance the experiment only for the learning phase. In order to quantify comparison with FAM-BSO. Three different numbers of clusters, F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 767 Table 4 The accuracy rate(%) for UCI data sets with 95% confidence intervals (‘‘Mean’’, ‘‘Upper’’ and ‘‘Lower’’ indicate the mean accuracy, upper and lower bounds of the 95% confidence intervals, respectively). Data FAM FAM-BSO θ 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Upper 88.99 83.02 83.52 83.60 86.11 86.53 88.61 87.03 MK1 Mean 85.72 80.84 82.65 82.23 83.48 84.42 86.67 86.12 Lower 82.02 78.60 81.80 79.99 81.38 82.09 85.01 84.95 Upper 86.50 80.78 84.44 87.87 88.91 90.47 87.01 89.70 Sonar Mean 84.99 77.73 81.93 86.01 86.77 87.97 86.45 86.07 Lower 83.11 74.72 79.81 83.81 85.37 82.83 85.07 83.17 Upper 97.14 75.32 85.14 88.25 97.99 98.88 98.43 98.43 Wine Mean 95.17 69.87 81.67 84.29 95.45 97.15 96.58 96.51 Lower 92.22 47.71 78.19 74.14 92.06 95.88 94.81 94.66 Upper 84.46 72.51 70.21 74.97 81.55 84.90 84.46 85.75 German Mean 79.19 64.76 67.17 71.35 76.97 82.92 81.55 82.69 lower 64.21 58.30 58.79 69.44 73.55 80.76 78.60 78.61 Upper 97.18 87.59 92.41 95.08 96.13 98.05 98.03 97.87 WBCD Mean 96.01 86.77 91.13 94.54 95.49 96.49 96.65 96.34 lower 94.96 85.49 87.72 94.24 94.67 95.34 95.75 94.97 Upper 79.59 58.88 64.23 71.06 77.69 86.31 80.93 81.59 Vehicle Mean 78.06 51.26 55.75 66.79 74.87 81.78 79.32 79.68 lower 75.41 45.37 52.09 61.09 71.68 79.36 76.71 77.01 Upper 92.42 87.31 84.71 88.51 89.84 93.43 92.78 94.22 Ionosphere Mean 91.68 82.67 82.45 87.21 88.88 91.93 91.69 92.18 lower 90.91 76.26 80.59 85.75 85.12 91.04 90.70 90.46 Upper 82.17 67.96 66.45 67.75 74.69 77.54 86.68 87.12 PID Mean 80.34 62.95 62.05 65.48 73.86 74.26 78.84 79.10 lower 79.18 50.68 55.55 62.41 72.64 71.56 75.65 76.16 Upper 98.35 83.76 82.06 92.92 97.51 98.42 98.33 99.35 Zoo Mean 95.40 71.33 76.57 87.96 95.85 95.71 94.79 96.22 lower 93.35 61.33 73.22 82.50 94.33 93.11 93.02 92.76 Upper 61.31 56.93 60.02 60.87 59.17 57.06 59.83 60.02 Hill-Valley Mean 58.82 53.31 56.21 59.01 57.62 56.25 55.84 58.63 lower 53.54 46.76 53.70 57.11 56.24 55.51 51.89 53.50 Table 5 all models, i.e., FAM-GA, FAM-PSO and FAM-BSO, have the same The mean number of selected features for each prototype node. numbers of prototype nodes since FAM is used as the underlying Data set FAM FAM-BSO learning model. The number of prototype nodes increases by θ varying ρa from 0.2 to 0.9, as shown in Fig. 4. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Table 8 indicates the selected features by FAM-GA, FAM-PSO MK1 166 17.63 33.18 49.83 66.63 83.34 99.81 116.38 and FAM-BSO. For all models, ρa is set to 0.7, and θ in both Sonar 60 8.16 13.80 19.50 24.62 29.85 35.18 40.89 FAM-PSO and FAM-BSO is set to 0.6. FAM-BSO is able to reduce Wine 13 2.04 3.11 4.29 5.26 6.41 7.92 8.98 the model complexity considerably by selecting fewer or similar German 24 3.97 6.78 8.08 10.82 12.03 15.74 17.63 numbers of features as compared with those of FAM-GA and WBCD 30 3.36 7.23 9.92 12.11 14.84 17.88 20.81 Vehicle 18 3.06 4.61 5.80 7.43 9.03 10.61 12.09 FAM-PSO. Moreover, the t-test score, F-measure score, sensitivity Ionosphere 34 5.05 7.83 10.87 13.89 17.05 20.43 23.34 and specificity rates are presented in Appendix E. PID 8 1.35 1.94 2.62 3.32 4.01 4.65 5.36 Zoo 17 2.84 4.15 5.43 6.94 8.42 9.97 11.61 5.2. A real-world case study Hill-valley 100 15.04 24.76 32.16 39.94 48.34 56.77 67.47 In this section, FAM-BSO is evaluated with a real-world case study, i.e., human motion detection. A total of 390 raw waveform i.e., 3, 5 and 7, are implemented in BSO. For all methods, different data from different subjects (humans) have been collected using ρa values ranging from 0.2 to 0.9 are used for FAM, while the the existing 3-axis accelerometer in smartphones. Three smart- maximum iteration and particle/population size are set to 1000 phones are located in three pockets of each subject: Shirt pocket, and 100, respectively. For FAM-PSO and FAM-BSO θ is set to belt pocket, and front pocket [53,54]. Each subject performs three 0.6. The GA parameters are set to: mutation probability = 0.1, types of activities: walking, running and climbing. The raw wave- crossover probability = 0.9 and 20 prototypes are replaced in form data are recorded in the time domain. After pre-processing, each population. The PSO parameters are set to: c1 = c2 = 1.49. nine statistical features: mean (m), root mean square (RMS), Fig. 3 shows the accuracy rates of the original FAM, FAM-GA, standard deviation (SD), Skewness, Kurtosis, Crest factor (CF), FAM-PSO and FAM-BSO with different numbers of clusters. For Latitude factor (LF), Shape factor (SF) and Impulse factor (IF), are all except Hill-valley and Sonar data sets, FAM-BSO with m = 3, 5 extracted and normalized between 0 and 1. Therefore, each data and 7 performs better than or similar to other methods. However, sample consists of 27 features (9 statistical features × 3 axis). for the Hill-valley data set, FAM-BSO with m = 3 and 5, and for The extracted features are fed into FAM-BSO to recognize the the Sonar data set FAM-BSO with m = 7 produce inferior results respective subject’s motion, either walking, climbing or running. as compared with those from other methods. In addition, the The 10-fold cross-validation was repeated 10 times, 90% of the created prototype nodes in FAM are shown in Fig. 4. Note that data samples are used for training (80% for learning and 10% for 768 F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 Table 6 The average and best (in parentheses) accuracy rates (%) for UCI data sets (‘‘–’’ indicates the data sets were not used for evaluation with the corresponding method, as published in the literature). Data FAM-BSO PSO(3-1) PSO(4-1) BPSO-2 U-FAM ACO- GPmtfs set (best) (best) (best) Stage(best) [27] ER [32] [43] [43] [52] [1] MK1 84.42 85.51 84.62 85.70 – 79.53 – (88.19) (90.21) (90.21) (89.51) – Sonar 87.97 78.03 77.40 – 72.00 – 86.26 (90.74) (85.71) (87.30) Wine 97.15 96.62 96.15 96.94 89.00 – 94.82 (100.0) (98.77) (98.88) (100.0) German 82.92 69.13 69.27 68.93 – 70.17 – (84.93) (71.67) (74.33) (73.67) WBCD 96.49 93.02 93.78 92.98 93.00 – 96.31 (98.83) (94.15) (94.74) (92.98) Vehicle 81.78 85.26 85.06 84.47 53.00 79.53 78.45 (88.65) (87.99) (86.61) (85.04) Ionos- 91.93 86.50 87.45 89.52 – 92.12 – phere (94.15) (94.29) (92.38) (93.33) PID 74.26 – – – 74.00 – – (79.69) Zoo 95.71 95.62 95.64 – – – – (98.74) (97.14) (97.14) Hill- 56.25 57.78 57.68 57.61 – 54.13 – valley (57.64) (60.71) (61.26) (60.44) Table 7 The mean number of selected features (‘‘–’’ indicates the data sets were not used for evaluation with the corresponding method, as published in the literature). Data set FAM-BSO PSO(3-1) PSO(4-1) BPSO- U-FAM ACO GPmtfs [43] [43] 2Stage [27] -ER [32] [52] [1] MK1 83.34 107.14 83.54 80.72 – 83.03 – Sonar 29.85 31.76 11.3 – 36.00 – 9.45 Wine 6.41 9.44 8.1 5.1 11.01 – 4.08 German 12.03 16.82 13.68 8.62 – 10.76 – WBCD 14.84 19.06 8.12 6.68 17.01 – 6.72 Vehicle 9.03 10.58 9.54 7.3 9.00 9.86 5.37 Ionosphere 17.05 18.38 3.26 8.9 – 12 – PID 4.01 – – – 7.00 – – Zoo 8.42 9.96 7.98 – – – – Hill-valley 48.34 60.65 13.16 37.1 – 47.63 – feature selection) and the remaining 10% of data samples are used Table 8 for test. To evaluate the robustness of FAM-BSO, noise is injected The mean number of selected features by FAM-GA, FAM-PSO and FAM-BSO with m = 3, 5 and 7 (ρa = 0.7, and for both PSO and BSO θ is set to 0.6). by randomly altering 10% of class labels pertaining to learning Data set FAM-GA FAM-PSO FAM-BSO samples. The ρa value of all methods, FAM, FAM-GA, FAM-PSO and FAM-BSO, is set to 0.7, and θ for both FAM-PSO and FAM-BSO m=3 m=5 m=7 is set to 0.6. MK1 83.06 56.22 55.6 55.93 62.27 A design of experiment (DOE) method, i.e., the center compos- Sonar 30.14 18.66 26.03 23.22 27.21 Wine 6.61 5.29 5.54 4.96 5.23 ite design (CCD) [55], is used to analyze the effects of different German 13.12 20.11 16.68 13.42 13.81 setting of the number of clusters (m) and the probability (Pone or WBCD 19.93 19.08 14.82 14.25 14.19 P6b ) towards the FAM-BSO performance. In this experiment, each Vehicle 9.92 13.58 11.09 10.75 10.64 parameter is set to three levels, i.e., low, medium and high, as Ionosphere 17.9 11.99 12.98 16.16 13.24 shown in Table 9. Fig. 5 shows five possible combinations of the PID 4.46 6.72 4.84 4.8 4.39 Zoo 8.56 9.01 8.1 8.23 7.48 parameters. The average fitness values with standard deviation of Hill-valley 50.12 62.1 41.12 42.56 48.46 all five experimental configurations are shown in Table 10. Exp. 5 (FAM-BSO with m = 7, and Pone = 0) achieves the lowest fitness value for both noisy and noise-free data samples. This indicates Table 9 that the individuals generated using two clusters perform the The parameters and levels of center composite design. best, indicating the capability of the brainstorming process of Parameter Level BSO. Low (−1) Medium (0) High (+1) We developed the Extreme Learning Machine (ELM)-BSO Numbers of clusters (m) 3 5 7 model for further performance comparison with FAM-BSO. Ta- Probability Pone (P6b ) 0 0.5 1 bles 11 and 12 show the accuracy rates and the numbers of selected features for noise-free and noisy data samples. As ex- pected, the accuracy rates of all methods decrease with noise injected to the learning samples. FAM-BSO with m = 3 and Table 13 shows the pairwise t-test results for performance 5 produced better accuracy rates for noise-free and noisy data sets, respectively. However, ELM-BSO with different number of comparison between FAM-BSO (m = 3, 5, 7) versus FAM, FAM- clusters managed to use fewer numbers of features (Table 12). GA, FAM-PSO and ELM-BSO (m = 3, 5, 7). In term of accuracy F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 769 Fig. 3. The accuracy rates of FAM-GA, FAM-PSO and FAM-BSO with different ρa settings from 0.2 to 0.9. Except the Hill-valley and Sonar data sets, FAM-BSO produces better or similar results as compared with those of FAM-GA and FAM-PSO. Fig. 4. The number of created prototype nodes by FAM with different ρa settings. The number of nodes increases as ρa increases from 0.2 to 0.9. Table 10 Effects of numbers of clusters and probability (Pone ) on average fitness value (standard deviation) of FAM-BSO. Experiment # Noise-free (std. dev.) Noisy (std. dev.) Experiment 1 (m = 3, and Pone = 0) 0.0154 (1.4e−3) 0.0616 (10.7e−3) Experiment 2 (m = 5, and Pone = 0.5) 0.0246 (8.8e−3) 0.0669 (13.5e−3) Experiment 3 (m = 7, and Pone = 1) 0.0112 (5.7e−3) 0.0613 (8.8e−3) Experiment 4 (m = 3, and Pone = 1) 0.0236 (17.9e−3) 0.0658 (9.5e−3) Experiment 5 (m = 7, and Pone = 0) 0.0098 (2.4e−3) 0.0574 (6.8e−3) (Table 13 (a)), FAM-BSO performed similar to or better than FAM- of features as compared with those of FAM-PSO, while ELM- GA, except m = 7 for noise-free data. FAM-BSO performed better BSO used fewer numbers of features as compared with those of than FAM-PSO and ELM-BSO in both noise-free and noisy data. FAM-BSO. In term of complexity (Table 13 (b)), FAM-BSO with m = 3 for Table 14 shows the execution time of various methods. As noise-free data, and m = 3 and 7 for noisy data performed similar expected, both ELM-BSO and FAM-BSO require longer execution to FAM-GA. FAM-BSO (m = 3, 5, 7) selected fewer numbers time as compared with those of FAM-GA and FAM-PSO, which 770 F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 Table 11 The mean accuracy rates with standard deviations of FAM, ELM, FAM-GA, FAM-PSO, ELMM-BSO (m = 3, 5, 7) and FAM-BSO (m = 3, 5, 7) for the human motion recognition task (for FAM, ρa = 0.7, while for both BSO and PSO θ is set to 0.6). Noise level FAM ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 95.97 83.82 96.83 96.42 94.87 94.26 94.18 97.04 96.76 96.35 ±1.6 ±3.9 ±1.3 ±1.5 ±3.6 ±3.9 ±2.4 ±1.1 ±1.5 ± 1.6 10% 89.18 76.59 93.63 92.30 91.82 92.42 92.05 93.60 93.90 93.70 ±2.1 ±6.6 ±1.6 ±1.9 ±3.7 ±2.7 ±2.5 ±1.4 ±1.5 ± 1.8 Table 12 The mean number of selected features by FAM, ELM, FAM-GA, FAM-PSO, ELMM-BSO (m = 3, 5, 7) and FAM-BSO (m = 3, 5, 7) for human motion recognition task ( for FAM, ρa = 0.7, while for both PSO and BSO θ is set to 0.6). Noise level FAM & ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 27 12.8 17.2 12.6 11.9 12.1 13.6 15.8 15.7 10% 27 17.1 18.8 14.9 15.2 14.7 17.6 18.8 17.2 Table 13 increase in line with more clusters and noise. This long execution The t-test results for pairwise comparison of FAM-BSO versus other methods for time is mainly due to clustering of the solutions in BSO. Although the human motion recognition task in terms of (a) accuracy, and (b) number individual ELM requires a shorter execution time as compared of selected features (For FAM, ρa = 0.7; for both PSO and BSO θ is set to 0.6. Symbol ‘‘+’’/‘‘−’’ indicates that FAM-BSO performs significantly better/worse with that of FAM, ELM-BSO requires a longer computational time than the indicated method, while symbol ‘‘=’’ means both achieves a similar as compared with that of FAM-BSO. This is mainly because of performance. the requirement of re-training ELM for every subset of features, (a) while FAM only needs to use open prototype nodes to compute Level of noise m FAM-GA FAM-PSO ELM-BSO the fitness value of the selected subset of features. m=3 m=5 m=7 In addition, we reduced the number of iterations of BSO (for both ELM-BSO and FAM-BSO) to 350, which required a similar ex- 7 − = + + + 0% 5 = + + + + ecution time as that of FAM-GA and FAM-PSO. Table 15 shows the 3 + + + + + accuracy rates and number of selected features of FAM-BSO (m = 7 = + + + + 3, 5, 7). In term of accuracy, as can be seen in Table 16(a) (the 10% 5 = + + + + pairwise t-test results), FAM-BSO performed similar to FAM-GA, 3 = + + + + except m = 3 for noisy data, while outperformed FAM-PSO and (b) ELM-BSO. In term of complexity (number of features), FAM-BSO performed worse than FAM-GA for noise-free data, but similar in Level of noise m FAM-GA FAM-PSO ELM-BSO noisy data. FAM-BSO generally performed better than FAM-PSO m=3 m=5 m=7 and ELM-BSO in both noise-free and noisy data (Table 16(b)). 7 − + − − − Finally, we used the Principal Component Analysis (PCA) to 0% 5 − + − − − rank the features, and remove those features with a variance 3 = + − − − score lower than 0.05. Then, the remaining features were used 7 = + − − − to evaluate the performance of FAM-BSO and other methods. 10% 5 − = − − − 3 = + − − − After feature ranking with PCA, a total of 15 and 13 features remained for the noise-free and noisy data sets, respectively. Table 17 shows the accuracy rates of various methods. FAM-BSO outperformed other methods for both noise-free and noisy data sets. Overall, accuracy of all methods reduced for both noise- free and noisy samples as compared with those with all features (Table 11). From Table 18, all methods selected approximately 50% of the features as compared with those with all features (Table 12). 5.3. Discussion In general, FAM-BSO is able to produce promising results, in terms of both accuracy and the number of selected features, as compared with the original FAM and other feature selection methods. Nevertheless, several improvements are possible. One possible improvement is generating and updating new solutions in BSO. In this research, the solutions (individuals) are generated based on a normal distribution and is updated using a Gaussian random number. In this aspect, binary values similar to the GA can be used. In addition, BSO uses the distance-based k-mean Fig. 5. A center composite design for two parameters, which consists of five clustering algorithm to cluster the solutions. A long execution points. time to measure the distance between each sample and the cluster center is required. This becomes worse when the feature dimension is high. As such, other clustering algorithms could be investigated, in order to accelerate the BSO computational time. F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 771 Table 14 Computational time (second) with standard deviations of FAM, ELM, FAM-GA, FAM-PSO, ELM-BSO (m = 3, 5, 7) and FAM-BSO (m = 3, 5, 7) for human motion recognition task (for FAM, ρa = 0.7, while for both PSO and BSO θ is set to 0.6). Noise level FAM ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 0.42 ± 0.016 0.35 ± 0.012 51.0 ± 3.60 63.6 ± 4.32 591.2 ± 14.14 608.5 ± 10.81 597.3 ± 16.60 201.6 ± 6.73 219.0 ± 9,42 226.8 ± 7.43 10% 0.46 ± 0.014 0.36 ± 0.016 56.4 ± 2.86 79.2 ± 3.92 652.1 ± 14.75 696.3 ± 17.60 654.7 ± 16.75 212.4 ± 7.32 224.4 ± 6.08 234.0 ± 6.46 Table 15 The accuracy rates and mean number of selected features of FAM-BSO (m = 3, 5, 7) for human motion recognition task (for FAM, ρa = 0.7 and number of iterations=350. Level of noise Accuracy Number of selected features m=3 m=5 m=7 m=3 m=5 m=7 0% 96.75 ± 1.6 96.64 ± 1.9 96.78 ± 1.4 14.2 17.6 16.4 10% 93.16 ± 1.7 93.52 ± 1.5 93.48 ± 1.6 17.8 15.2 19.8 Table 16 Then, the BSO model adopted to search and select the best feature The t-test results for pairwise comparison of FAM-BSO versus other methods for subset that maximizes the classification accuracy while minimiz- the human motion recognition task. In terms of (a) accuracy, and (b) number of ing the numbers of features. The effectiveness of FAM-BSO has selected features (For FAM, ρa = 0.7; for both PSO and BSO θ is set to 0.6). Symbol ‘‘+’’/‘‘−’’ indicates that FAM-BSO performs significantly better/worse been evaluated using ten benchmark data sets and a real-world than the indicated method, while symbol ‘‘=’’ means both achieves a similar case study, i.e., human motion detection. The results of FAM-BSO performance). were compared with those of the original FAM and other state- (a) of-the-art methods. Overall, FAM-BSO is able to produce good Level of noise m FAM-GA FAM-PSO ELM-BSO results, which are similar to, if not better than, those of other m=3 m=5 m=7 methods reported in the literature. However, FAM-BSO still needs a more detailed analysis, so 7 = + + + + 0% 5 = + + + + that its robustness can be improved. Our future work is focused 3 = + + + + on improving FAM-BSO with improved methods for generating 7 = + + + + and updating new solutions as well as clustering the solutions. In 10% 5 = + + + + addition, the possibility of using the Variational Auto-Encoders 3 − + + + + to design a generative model of data for prediction that can be (b) combined with BSO will be investigated. Real-world application on FAM-BSO to other domains will also be studied. Level of noise m FAM-GA FAM-PSO ELM-BSO m=3 m=5 m=7 Acknowledgments 7 = + + + + 0% 5 = + + + + This work is partially supported by the National Natural Sci- 3 = + + + + ence Foundation of China (Grant Nos. 61773197, 61772344, 7 = + + + + 61811530324, 61732011 and 61761136008), the Nanshan District 10% 5 = + + + + Science and Technology Innovation Bureau (grant No. 3 − + + + + LHTD20170007), the Science and Technology Innovation Com- mittee of Shenzhen City (Grant Nos. CKFW2016041415372174, GJHZ20170314114424152), and the Natural Science Foundation 6. Summary of Shenzhen University (Grant Nos. 827-000140, 827-000230, and 2017060). In this paper, a new FAM-BSO model, which is an evolutionary- Declaration of competing interest based feature selection method, for solving feature selection problems for data classification has been proposed. Firstly, FAM No author associated with this paper has disclosed any po- was used as the underlying model to learn the training samples. tential or pertinent conflicts which may be perceived to have Table 17 The mean accuracy rates with standard deviations of FAM, ELM, FAM-GA, FAM-PSO, ELMM-BSO (m = 3, 5, 7) and FAM-BSO (m = 3, 5, 7) after removing low-variance features using PCA for the human motion recognition task (for FAM, ρa = 0.7, while for both BSO and PSO θ is set to 0.6). Noise level FAM ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 93.21 81.27 94.66 94.93 91.16 90.54 93.83 95.12 95.62 94.63 ±2.7 ±2.4 ±1.8 ±2.2 ±1.8 ±1.3 ±2.7 ±2.7 ±2.1 ± 2.8 10% 82.45 77.34 85.37 85.42 83.22 82.42 83.65 87.13 86.84 87.57 ±3.2 ±3.8 ±2.7 ±2.2 ±2.5 ±2.7 ±2.5 ±3.5 ±3.8 ± 2.3 Table 18 The mean number of selected features by FAM, ELM, FAM-GA, FAM-PSO, ELMM-BSO (m = 3, 5, 7) and FAM-BSO (m = 3, 5, 7) after removing low-variance features using PCA for human motion recognition task ( for FAM, ρa = 0.7, while for both PSO and BSO θ is set to 0.6). Noise level FAM & ELM FAM-GA FAM-PSO ELM-BSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 0% 15 6.1 8.6 5.8 5.9 6.3 6.1 6.5 6.6 10% 13 5.4 5.9 6.2 5.1 4.9 5.9 6.1 7.1 772 F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 where ρ is the vigilance parameter of Fuzzy ART. However, if the condition in Eq. (A.5) is not satisfied, a mismatch occurs, whereby the selected node J is deactivated, and a new search cycle is triggered in Fuzzy ART to select a new winning node. This search cycle repeats until one of the existing prototype nodes satisfies the condition in Eq. (A.5), or a new prototype node is created in f2 to encode the current input sample. Then, learning take place in order to update Jth weight vector (WJ ) in f2 as follows: (new ) WJ = β (A ∧ WJ(old) ) + (1 − β )WJ(old) (A.6) where β is learning rate of Fuzzy ART. Appendix B. Fuzzy ARTMAP Fig. B.1 shows the structure of FAM. It uses two Fuzzy ART models, i.e., ARTa and ARTb , to form a mapping (f ab ) from the input samples to their corresponding outputs. ARTa and ARTb receive the complement-coded input sample (A) and its corresponding target class (B), respectively. When the winning nodes in both ARTa and ARTb are selected, the map-field vigilance test is applied as follows: Fig. A.1. The structure of Fuzzy ART. The f0 layer propagates the complement- coded input pattern to the recognition layer (f1 ). Then, the similarity level |yb ∧ WJab | between the input pattern and each existing prototype patterns in the f2 layer > ρab (B.1) is measured using Eqs. (A.2)–(A.5). If none of the existing prototype patterns in |yb | the f2 layer is able to satisfy the conditions (Eqs. (A.2)–(A.5)), a new node is where WJab is the weight vector from f2a to f ab , ρab is the map- added in the f2 layer to encode the current input pattern. field vigilance parameter, and yb indicates the output vector of f2b , which is defined as follows: { impending conflict with this work. For full disclosure statements 1, k=K refer to https://doi.org/10.1016/j.asoc.2019.04.037. yb = (B.2) 0, other w ise Appendix A. Fuzzy ART where K is the winning ARTb prototype node. If condition in Eq. (B.1) fails, it means the current Jth winning node in f2a makes As shown in Fig. A.1, Fuzzy ART network includes three layers: an incorrect predicted class in ARTb . To correct this erroneous pre-processing layer f0 , input layer f1 and recognition layer f2 . prediction, a match-tracking procedure is triggered to update the The pre-processing layer performs complement coding in order to vigilance parameter of ARTa , as follows: avoid the problem of category proliferation, which transforms the |A ∧ WJa | M-dimensional input sample, a ∈ [0, 1], into a 2M-dimensional ρa = +δ (B.3) input sample, as follows: |A| where δ > 0. Then, a new search cycle with the updated ρa A = (a1 , . . . , am , 1 − a1 , . . . , 1 − am ) (A.1) setting ensues. This process ends once the map-field vigilance test The input layer propagates the complement-coded pattern is satisfied. As such, learning takes place in which the Jth node in into recognition layer. The recognition layer is a dynamic layer f2a is updated to: containing prototype nodes, whereby new prototype node can be a(new ) added whenever it is necessary. WJ = βa (A ∧ WJa(old) ) + (1 − βa )WJa(old) (B.4) During learning, Fuzzy ART receives the complement-coded where βa is learning rate of ARTa . of input sample (A) to determine the similarity level between current input A and the jth prototype node in f2 , the choice Appendix C. Complexity analysis of FAM-BSO function is used as follows: |A ∧ Wj | In this section, the time complexity analysis of FAM-BSO is Tj = (A.2) α + | Wj | conducted using the big-O notation [57]. The analysis is divided into two parts: FAM and BSO. As stated in Section 3.2, FAM where α > 0 is the learning parameter, Wj ≡ (wj,1 , . . . , wj,2M ) is consists of two fuzzy ART models and a map-field. Let M and N the weight vector of the jth prototype node in f2 , and ∧ indicates denote the number of input features and number of prototype the fuzzy and operator [56]: nodes, respectively. As reported in [58] and [59], the worst-case (u ∧ v )i ≡ min(ui , vi ) (A.3) time complexity of Fuzzy ART model is N 2 + M, which is O(N 2 ) when M → ∞ and N → ∞. However, in the map-field, the where ui and vi are the ith dimension of A and Wj , respectively. worst-case scenario is when the match-tracking process happens The prototype node with the highest choice score is chosen as the for all prototype nodes in ARTa . Let Ma (Mb ) and Na (Nb ) be winning node, denoted as node J, as follows: the number of features and number of nodes in ARTa (ARTb ), TJ = max(Tj : j = 1, 2, . . . , N) (A.4) respectively. In the worst-case scenario, FAM requires a total of (Na2 + Ma Na ) + (Nb2 + Mb Nb ) + Na (Na2 + Ma Na ) time-complexity If node J satisfies the vigilance criterion, resonance is said to for ARTa , ARTb , and the map-field, respectively, which is O(N 3 ) occur. Ma → ∞, Na → ∞, Mb → ∞ and Nb → ∞. |A ∧ WJ | In BSO, in addition to n solutions/individuals with P dimen- >ρ (A.5) sions, the procedure related to clustering using k-mean algorithm |A| F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 773 Fig. B.1. The structure of FAM. It consists of two fuzzy ART models, i.e., ARTa and ARTb , which are connected through a map-field. ARTa and ARTb receive the complement-coded input sample and its corresponding label, respectively. Once the wining nodes in ARTa (J) and ARTb (K ) are found, the map-field vigilance test (Eq. (B.1)) is used to map the Jth wining node (with respect to the input sample) in W a to the K th winning node (with respect to the target label) in W b . If the condition in Eq. (B.1) is not satisfied, a new search cycle is triggered to learn current input sample. (where m=3, 5 and 7), disrupting the cluster center, and updating Table E.1 individuals are used. This procedure is shown in Fig. 2. According Comparison between mean and median of the accuracy and number of selected features of FAM-BSO for UCI data set (ρa = 0.8, θ = 0.6 and m = 5) to [60], the time complexity of k-mean clustering is O(n ∗ k ∗ i ∗ d) Data set Accuracy # of selected features where n, k, i and d are the number of samples, number of clusters, number of iterations, and number of features, respectively, which Mean Median Difference Mean Median Difference can be simplified because k and d are constant numbers, and the MK1 83.48 84.02 −0.54 66.63 66.6 0.03 number of iterations, i, in the worst-case scenario equals to the Sonar 86.77 85.71 1.06 24.62 24.85 −0.23 number of samples, n. Therefore, the maximum time complexity Wine 95.45 97.40 −1.95 5.26 5.26 0 German 76.97 75.33 1.64 10.82 10.06 0.76 of k-mean clustering is O(n2 ). The worst-case time complexity of WBCD 95.49 95.60 −0.11 12.11 12.98 −0.87 disrupting the cluster center is O(P). The updating operator adds a Vehicle 74.87 77.23 −2.36 7.43 7.41 0.02 uniform random value to each dimension of the selected solutions Ionosphere 88.88 86.79 2.09 3.89 13.83 0.06 (single cluster or two combined clusters). The worst-case time PID 73.86 74.5 −0.64 3.32 3.30 0.02 complexity for the updating process is O(P). The maximum step Zoo 95.85 95.09 0.76 6.94 7.04 −0.1 to compute fitness value is measuring the choice value for all Hill-valley 57.62 56.89 0.73 39.94 40.91 −0.97 prototype nodes, which is O(Na ). As such, the worst-case time complexity for BSO is O(n2 ) + O(n) + O(P) + O(Na ), which is O(n2 ) when all variables extend to infinity. Therefore, the worst-case where precision is computed using: time complexity of FAM-BSO can be determined by combining the time complexity of FAM O(Na3 ) and BSO O(n2 ), which is O(Na3 ) true positiv e precision = (D.4) when all variables extend to infinity. true positiv e + false positiv e Appendix D. Performance indicators Appendix E. T-test, F-measure, sensitivity and specificity anal- ysis for UCI data sets The t-test is a pairwise statistical test which can be used to compare two models, as follows: Table E.1 compares the mean and median of accuracy and mean1 − mean2 number of selected features by FAM-BSO for UCI data sets, based T − v alue = √ (D.1) on 10-fold cross-validation, with averages of individual folds 2 SP ∗ n computed from 30 repeated runs. As can be seen there is no significant difference between the mean and median results with where mean1 and mean2 represent the mean accuracy rates of respect to number of selected features. The largest difference models 1 and 2 over samples, respectively, and SP is defined as between mean and median accuracy is lower than 2.4%, while follows: that of the number of selected features is lower than 1. √ v ar12 + v ar22 Table E.2 shows the t-test scores for pairwise comparison SP = (D.2) between FAM-BSO with different numbers of clusters and FAM, 2 FAM-GA and FAM-PSO. An analysis of the results are as follows: where var1 and var2 indicate the variance of models 1 and 2, respectively. • FAM-BSO versus FAM: Out of 30 comparative results (10 The sensitivity/specificity [61] are the ratio of correctly pre- data sets, each with cluster number of 3, 5, and 7), FAM- dicted positive/negative samples over the total number of posi- BSO outperforms FAM 14 times, ties with FAM 8 times, and tive and negative samples, respectively. The F-measure is com- is inferior to FAM 8 times; puted using: • FAM-BSO versus FAM-GA: Out of 30 comparative results, precision × recall FAM-BSO outperforms FAM-GA 25 time, ties with FAM-GA F − measure = 2 . (D.3) 3 times, and is inferior to FAM-GA 2 times; precision + recall 774 F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 Table E.2 The t-test results for pairwise comparison of FAM-BSO versus other methods for the UCI data sets (for FAM, ρa = 0.7; for both PSO and BSO, θ is set to 0.6). Symbol ‘‘+’’/‘‘−’’ indicates that FAM-BSO performs significantly better/worse than the indicated method, while symbol ‘‘=’’ means both achieves a similar performance. Level of noise m FAM FAM-GA FAM-PSO Data set m FAM FAM-GA FAM-PSO 7 + + + 7 + + + MK1 5 + + + Vehicle 5 + + + 3 + + + 3 + + + 7 − − = 7 + + + Sonar 5 = + + Ionosphere 5 − + + 3 + + + 3 + + + 7 − + + 7 + + + Wine 5 + + + PID 5 + + + 3 = + + 3 = + + 7 = + + 7 + + + German 5 − + − Zoo 5 + + + 3 + + + 3 − − + 7 + + + 7 − + − WBCD 5 + + + Hill-valley 5 − + − 3 + + + 3 − + + Table E.3 Table E.5 The F-measure scores for the UCI data sets (ρa = 0.7, and for both PSO and BSO The specificity rates for the UCI data sets (ρa = 0.7, and for both PSO and BSO θ is set to 0.6) θ is set to 0.6) Data set FAM-GA FAM-PSO FAM-BSO Data set FAM-GA FAM-PSO FAM-BSO m=3 m=5 m=7 m=3 m=5 m=7 MK1 0.8882 0.8932 0.8569 0.9036 0.9016 MK1 0.8946 0.8773 0.9111 0.8919 0.9109 Sonar 0.8264 0.7933 0.8561 0.8332 0.7979 Sonar 0.9148 0.9081 0.9211 0.9502 0.9087 German 0.8415 0.8794 0.8976 0.8720 0.8876 German 0.6509 0.7211 0.7516 0.7200 0.7458 WBCD 0.9486 0.9579 0.9675 0.9665 0.9751 WBCD 0.9678 0.9718 0.9886 0.9792 0.9944 Ionosphere 0.9056 0.9186 0.9333 0.9257 0.9366 Ionosphere 0.7262 0.7554 0.8324 0.7946 0.8218 PID 0.7847 0.8222 0.8247 0.8325 0.8411 PID 0.6110 0.6346 0.6072 0.6320 0.6109 Hill-valley 0.5774 0.6044 0.6401 0.5085 0.4836 Hill-valley 0.6181 0.6054 0.5609 0.6030 0.5825 Table E.4 The sensitivity rates for the UCI data sets (ρa = 0.7, and for both PSO and BSO specificity rates for MK1, WBCD, Ionosphere and Hill-valley data θ is set to 0.6) sets. Data set FAM-GA FAM-PSO FAM-BSO m=3 m=5 m=7 MK1 0.8649 0.8838 0.8665 0.8934 0.8777 References Sonar 0.7702 0.7240 0.8133 0.7532 0.7306 German 0.8351 0.8786 0.9008 0.8658 0.8850 [1] E. Hancer, B. Xue, M. Zhang, D. Karaboga, B. Akay, Pareto front feature WBCD 0.9457 0.9606 0.9541 0.9662 0.9601 selection based on artificial bee colony optimization, Inform. Sci. 422 Ionosphere 0.9490 0.9608 0.9537 0.9567 0.9650 (2018) 462–479. PID 0.7814 0.8358 0.8506 0.8548 0.8752 [2] A.K. Das, S. Sengupta, S. Bhattacharyya, A group incremental feature Hill-valley 0.5661 0.6186 0.6842 0.4808 0.4814 selection for classification using rough set theory based genetic algorithm, Appl. Soft Comput. 65 (2018) 400–411. [3] B. Ma, Y. Xia, A tribe competition-based genetic algorithm for feature selection in pattern classification, Appl. Soft Comput. 58 (2017) 328–338. • FAM-BSO versus FAM-PSO: Out of 30 comparative results, [4] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. FAM-BSO outperforms FAM-PSO 24 times, ties with FAM- Soc. Ser. B Stat. Methodol. (1996) 267–288. PSO 3 times, and is inferior to FAM-PSO 3 times. [5] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2) (2005) 301–320. Based on the above analysis, FAM-BSO is able to outperform [6] Y. Zhang, D. Gong, Y. Hu, W. Zhang, Feature selection algorithm based on bare bones particle swarm optimization, Neurocomputing 148 (2015) FAM, FAM-GA, and FAM-PSO with statistically significant differ- 150–157. ence in results for majority of the evaluations. [7] H. Liu, L. Yu, Toward integrating feature selection algorithms for clas- F-score rates of FAM-GA, FAM-PSO and FAM-BSO with differ- sification and clustering, IEEE Trans. Knowl. Data Eng. 17 (4) (2005) ent numbers of clusters are shown in Table E.3. Note that for all 491–502. [8] W. Siedlecki, J. Sklansky, A note on genetic algorithms for large-scale models ρa is set to 0.7 and θ for both FAM-PSO and FAM-BSO is feature selection, Pattern Recognit. Lett. 10 (5) (1989) 335–347. set to 0.6. FAM-BSO with different numbers of clusters produces [9] H. Ishibuchi, T. Murata, I. Türkşen, Single-objective and two-objective high F-measure scores for all data sets (Table E.3). genetic algorithms for selecting linguistic rules for pattern classification The sensitivity and specificity of FAM-GA, FAM-PSO and FAM- problems, Fuzzy Sets and Systems 89 (2) (1997) 135–150. BSO with different numbers of clusters for data sets from machine [10] M.H. Aghdam, N. Ghasem Aghaee, M.E. Basiri, Text feature selection using ant colony optimization, Expert Syst. Appl. 36 (3) (2009) 6843–6853. learning repository are shown in Tables E.4 and E.5, respectively. [11] D.P. Muni, N.R. Pal, J. Das, Genetic programming for simultaneous feature As can be seen, FAM-BSO with different numbers of clusters selection and classifier design, IEEE Trans. Syst. Man Cybern. B 36 (1) achieves good sensitivity rates. In particular; FAM-BSO with m=3 (2006) 106–117. produces the highest sensitivity rate for three out of seven data [12] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimization for feature sets, i.e., Sonar, German and Hill-valley, and the highest speci- selection in classification: A multi-objective approach, IEEE Trans. Cybern. 43 (6) (2013) 1656–1671. ficity rate for three out of seven data sets, i.e., KM1, German [13] F. Hafiz, A. Swain, N. Patel, C. Naik, A two-dimensional (2-D) learning and Ionosphere. Overall, all three models (i.e., FAM-GA, FAM- framework for particle swarm based feature selection, Pattern Recognit. PSO and FAM-BSO) are able to achieve a balanced sensitivity and 76 (2018) 416–433. F. Pourpanah, Y. Shi, C.P. Lim et al. / Applied Soft Computing Journal 80 (2019) 761–775 775 [14] J.F. Chang, A performance comparison between genetic algorithms and [38] B. Chakraborty, Feature subset selection by particle swarm optimization particle swarm optimization applied in constructing equity portfolios, Int. with fuzzy fitness function, in: IEEE International Conference on Intelligent J. Innovative Comput. Inf. Control 5 (12) (2009) 5069–5079. System and Knowledge Engineering (ISKE), vol. 1, 2008, pp. 1038–1042. [15] P. Moradi, M. Gholampour, A hybrid particle swarm optimization for [39] R.P. Prado, J.E. Muñoz Expósito, S. García Galán, Flexible fuzzy rule bases feature subset selection by integrating a novel local search strategy, Appl. evolution with swarm intelligence for meta-scheduling in grid computing, Soft Comput. 43 (2016) 117–130. Comput. Inform. 33 (4) (2015) 810–830. [16] Y. Shi, Brain storm optimization algorithm, in: Advances in Swarm In- [40] R.E. Precup, M.C. Sabau, E.M. Petriu, Nature-inspired optimal tuning of telligence: Second International Conference, ICSI 2011, Chongqing, China, input membership functions of Takagi-Sugeno-Kang fuzzy models for June 12-15, 2011, Proceedings, Part I, Springer Berlin Heidelberg, Berlin, anti-lock braking systems, Appl. Soft Comput. 27 (2015) 575–589. Heidelberg, 2011, pp. 303–309. [41] S. Vrkalovic, T.A. Teban, I.-D. Borlea, Stable Takagi-Sugeno fuzzy control [17] G.A. Carpenter, S. Grossberg, N. Markuzon, J.H. Reynolds, D.B. Rosen, designed by optimization, Int. J. Artif. Intell. 15 (2) (2017) 17–29. Fuzzy ARTMAP: A neural network architecture for incremental supervised [42] A. Unler, A. Murat, A discrete particle swarm optimization method for learning of analog multidimensional maps, IEEE Trans. Neural Netw. 3 (5) feature selection in binary classification problems, European J. Oper. Res. (1992) 698–713. 206 (3) (2010) 528–539. [18] S. Grossberg, Adaptive pattern classification and universal recoding: II. [43] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimisation for feature Feedback, expectation, olfaction, illusions, Biol. Cybern. 23 (4) (1976) selection in classification: novel initialisation and updating mechanisms, 187–202. Appl. Soft Comput. 18 (2014) 261–276. [19] A. Al-Daraiseh, M. Georgiopoulos, G. Anagnostopoulos, A.S. Wu, M. Mol- [44] T. Butler Yeoman, B. Xue, M. Zhang, Particle swarm optimisation for laghasemi, GFAM: a genetic algorithm optimization of fuzzy ARTMAP, in: feature selection: A hybrid filter-wrapper approach, in: IEEE Congress on IEEE International Conference on Fuzzy Systems, 2006, pp. 315–322. Evolutionary Computation (CEC), 2015, pp. 2428–2435. [20] Y. Zhang, H. Ji, W. Zhang, TPPFAM: Use of threshold and posterior [45] Y. Shi, An optimization algorithm based on brainstorming process, Emerg. probability for category reduction in fuzzy ARTMAP, Neurocomputing 124 Res. Swarm Intell. Algorithm Opt. (2015) 1–35. (2014) 63–71. [46] G.A. Carpenter, S. Grossberg, A massively parallel architecture for a self- [21] F. Pourpanah, C.P. Lim, J.M. Saleh, A hybrid model of fuzzy ARTMAP and organizing neural pattern recognition machine, Comput. Vis. Graph. Image genetic algorithm for data classification and rule extraction, Expert Syst. Process. 37 (1) (1987) 54–115. Appl. 49 (2016) 74–85. [47] G.A. Carpenter, S. Grossberg, ART 2: Self-organization of stable category [22] B. Efron, Bootstrap methods: another look at the jackknife, in: recognition codes for analog input patterns, Appl. Opt. 26 (23) (1987) Breakthroughs in Statistics, Springer, 1992, pp. 569–593. 4919–4930. [23] S. Cheng, J. Chen, X. Lei, Y. Shi, Locating multiple optima via brain storm [48] G.A. Carpenter, S. Grossberg, D.B. Rosen, Fuzzy ART: Fast stable learning optimization algorithms, IEEE Access 6 (2018) 17039–17049. and categorization of analog patterns by an adaptive resonance system, [24] T. Marill, D. Green, On the effectiveness of receptors in recognition systems, Neural Netw. 4 (6) (1991) 759–771. IEEE Trans. Inf. Theory 9 (1) (1963) 11–17. [49] G.A. Carpenter, S. Grossberg, J.H. Reynolds, ARTMAP: Supervised real-time [25] K.J. Wang, K.H. Chen, M.A. Angelia, An improved artificial immune recogni- learning and classification of nonstationary data by a self-organizing neural tion system with the opposite sign test for feature selection, Knowl.-Based network, Neural Netw. 4 (5) (1991) 565–588. Syst. 71 (2014) 126–145. [50] M. Lichman, UCI Machine Learning Repository, University of California, [26] P. Pudil, J. Novovičová, J. Kittler, Floating search methods in feature Irvine, School of Information and Computer Sciences, 2013, URL http: selection, Pattern Recognit. Lett. 15 (11) (1994) 1119–1125. //archive.ics.uci.edu/ml. [27] A.K. Das, S. Goswami, A. Chakrabarti, B. Chakraborty, A new hybrid [51] C.J. Tan, C.P. Lim, Y.N. Cheah, A multi-objective evolutionary algorithm- feature selection approach using feature association map for supervised based ensemble optimizer for feature selection and classification with and unsupervised classification, Expert Syst. Appl. 88 (2017) 81–94. neural network models, Neurocomputing 125 (2014) 217–228. [28] I.S. Oh, J.S. Lee, B.R. Moon, Hybrid genetic algorithms for feature selection, [52] B. Xue, M. Zhang, W.N. Browne, New fitness functions in binary particle IEEE Trans. Pattern Anal. Mach. Intell. 26 (11) (2004) 1424–1437. swarm optimisation for feature selection, in: IEEE Congress on Evolutionary [29] P. Moradi, M. Rostami, Integration of graph clustering with ant colony Computation (CEC), 2012, pp. 1–8. optimization for feature selection, Knowl.-Based Syst. 84 (2015) 144–161. [53] F. Pourpanah, C.P. Lim, Q. Hao, A reinforced fuzzy ARTMAP model for data [30] R.K. Sivagaminathan, S. Ramakrishnan, A hybrid approach for feature classification, Int. J. Mach. Learn. Cybern. (2018) 1–13. subset selection using neural networks and ant colony optimization, Expert [54] F. Pourpanah, B. Zhang, R. Ma, Q. Hao, Non-intrusive human motion Syst. Appl. 33 (1) (2007) 49–60. recognition using distributed sparse sensors and the genetic algorithm [31] M.H. Aghdam, N. Ghasem-Aghaee, M.E. Basiri, Text feature selection using based neural network, in: 2018 IEEE SENSORS, 2018, pp. 1–4. ant colony optimization, Expert Syst. Appl. 36 (3) (2009) 6843–6853. [55] C.H. Wang, T.W. Lin, Improved particle swarm optimization to minimize [32] D.P. Muni, N.R. Pal, J. Das, Genetic programming for simultaneous feature periodic preventive maintenance cost for series-parallel systems, Expert selection and classifier design, IEEE Trans. Syst. Man Cybern. B 36 (1) Syst. Appl. 38 (7) (2011) 8963–8969. (2006) 106–117. [56] L.A. Zadeh, Fuzzy sets, in: Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems: [33] R.A. Davis, A.J. Charlton, S. Oehlschlager, J.C. Wilson, Novel feature selection Selected Papers By Lotfi a Zadeh, World Scientific, 1996, pp. 394–432. method for genetic programming using metabolomic 1 H NMR data, [57] T.H. Cormen, Introduction to Algorithms, MIT press, 2009. Chemometr. Intell. Lab. Syst. 81 (1) (2006) 50–59. [58] T. Burwick, F. Joublin, Optimal algorithmic complexity of fuzzy art, Neural [34] Y. Mei, S. Nguyen, B. Xue, M. Zhang, An efficient feature selection algorithm Process. Lett. 7 (1) (1998) 37–41. for evolving job shop scheduling rules with genetic programming, IEEE [59] F. Pourpanah, C.J. Tan, C.P. Lim, J. Mohamad-Saleh, A Q-learning-based Trans. Emerg. Top. Comput. Intell. 1 (5) (2017) 339–353. multi-agent system for data classification, Appl. Soft Comput. 52 (2017) [35] D.K. Agrafiotis, W. Cedeno, Feature selection for structure activity cor- 519–531. relation using binary particle swarms, J. Med. Chem. 45 (5) (2002) [60] V.R. Patel, R.G. Mehta, Impact of outlier removal and normalization 1098–1107. approach in modified k-means clustering algorithm, Int. J. Comput. Sci. [36] X. Wang, J. Yang, X. Teng, W. Xia, R. Jensen, Feature selection based on Issues (IJCSI) 8 (5) (2011) 331–336. rough sets and particle swarm optimization, Pattern Recognit. Lett. 28 (4) [61] J.L. Melsa, D.L. Cohn, Decision and Estimation Theory, in: McGraw-Hill (2007) 459–471. Series in Electrical Engineering: Communications and Information Theory, [37] C.L. Huang, J.F. Dun, A distributed PSO–SVM hybrid system with feature McGraw-Hill, 1978. selection and parameter optimization, Appl. Soft Comput. 8 (4) (2008) 1381–1391.