Expert Systems with Applications 42 (2015) 1325–1339 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Developing an approach to evaluate stocks by forecasting effective features with data mining methods Sasan Barak a,⇑, Mohammad Modarres b a Young Researcher Club Ardebil Branch, Islamic Azad University, Ardebil, Iran b Department of Industrial Engineering, Sharif University of Technology, Tehran, Iran a r t i c l e i n f o a b s t r a c t Article history: In this research, a novel approach is developed to predict stocks return and risks. In this three stage Available online 23 September 2014 method, through a comprehensive investigation all possible features which can be effective on stocks risk and return are identified. Then, in the next stage risk and return are predicted by applying data mining Keywords: techniques for the given features. Finally, we develop a hybrid algorithm, on the basis of filter and func- Stock market tion-based clustering; the important features in risk and return prediction are selected then risk and Data mining return re-predicted. The results show that the proposed hybrid model is a proper tool for effective feature Classification algorithm selection and these features are good indicators for the prediction of risk and return. To illustrate the Feature selection Function-based clustering method approach as well as to train data and test, we apply it to Tehran Stock Exchange (TSE) data from 2002 to 2011. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction result in precision view of stock future and investors gaining. While classifying the amount of risk and return to different catego- Of the most important concerns of market practitioners is ries like our method gives more specific and clear knowledge. future information of the companies which offer stocks. A reliable Therefore, in this study, the simultaneous prediction of risk and prediction of the company’s financial status provides a situation for return classes with different classification algorithms is the investor to more confident investments and gaining more prof- investigated. its (Huang, 2012). One can refer to different studies about share To predict risk and return variables accurately, the effective fac- gaining and return prediction, for example, time series stock price tors need to be identified. In fact, one of the key issues of stock pre- prediction model (Araújo & Ferreira, 2013), buy–hold–sell predic- diction design lies on how to select representative features for tion model (Wu, Yu, & Chang, 2014; Zhang, Hu, Xie, Zhang, et al., prediction (Zhang, Hu, Xie, Wang, et al., 2014). 2014), Index prediction model with Anfis (Svalina, Galzina, Lujic´, Most studies in this area focus on technical features, financial & Šimunovic´, 2013) or MARS and SVR (Kao, Chiu, Lu, & Chang, ratios or macroeconomic indicators. For example, Tsai and Hsiao 2013), profit gaining (Ng, Liang, Li, Yeung, & Chan, 2014). However, (2010) studied 8 financial ratios and 16 macroeconomic indicators unlike the return, risk has been rarely considered for prediction, as the main features to predict stock return by back propagation in while customers usually balance their return for a proper level of Taiwan stock market. Cheng, Chen, and Lin (2010) conducted a risk, then clearly both risk and return are important factors in comprehensive study on macroeconomic and technical features financial decision making (Barak, Abessi, & Modarres, 2013; Tsai, and studied 8 financial ratios and 10 macroeconomic indicators Lin, Yen, & Chen, 2011). Without risk evaluation the portfolio effi- to investigate their effect on return variation in Taiwan stock mar- cient frontier does not make sense. Thus, this paper implements ket. By applying probabilistic back propagation algorithm, rough the forecasting of both risk and return of stocks which has tremen- set and C4.5 Tree, they achieved 76% accuracy. de Oliveira, dous effect on price setting. Also, up-down prediction of stock Nobre, and Zárate (2013) use 15 technical indicators and 11 funda- movement such as (Patel, Shah, Thakkar, & Kotecha, 2014; Yu, mental indexes to prediction of stocks movement in Petrobras with Chen, & Zhang, 2014; Zhang, Hu, Xie, Wang, et al., 2014) cannot artificial neural networks and obtain 87.50% for direct prediction. Tsai et al. (2011) considered 19 financial ratios and 11 macroeco- nomic indicators in Taiwan stock market by combining logistic ⇑ Corresponding author at: No. 15, Shahriar 2 Alley, Danesh Street, Ardebil, Iran. regression algorithm, MLP back propagation and CART Tree to Tel.: +98 9356546404; fax: +98 4517723386. investigate their effect urn (negative or positive) on the stock E-mail address:

[email protected]

(S. Barak). http://dx.doi.org/10.1016/j.eswa.2014.09.026 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved. 1326 S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 return and achieved 66.67% accuracy based on bagging and voting discussion on real return and risk prediction with important fea- algorithms. In majority of studies, as mentioned, the focus is tures has been represented. Finally, some conclusion and future mostly on financial ratios, macroeconomic indicators, and techni- research directions are provided in Section 5. cal indicators based on experts’ ideas to predict returns. However, this paper presents a systematic and efficient methodology for 2. Proposed model comprehensive searching the potential representative features on stock market in 3 categories of financial ratio, profit and loss Our proposed algorithm which consists of three stages is shown reports, and stock pricing models and not arbitrarily choosing in Fig. 1. In the first stage a database is developed and data is pre- likely effective features. processed. Non-systematic risk as well as real return is predicted Furthermore, many studies have claimed and verified that fea- with classification algorithms in the next stage. A hybrid feature ture selection (FS) is the key process in stock prediction modeling selection algorithm is also presented in the third stage and risk (Tsai & Hsiao, 2010). Zhang, Hu, Xie, Wang, et al. (2014) use a cau- and return are re-predicted based on selected features. sal feature selection (CFS) algorithm to find effective features in Shanghai stock exchanges. The idea in their model is about causal- 2.1. First stage: developing financial database ities based feature selection algorithm. They assert that CFS repre- sents direct influences between various stock features, while This stage we utilize the concepts and techniques of input fea- correlation based algorithms cannot distinguish direct influences tures, response variables, and preprocessing models. from indirect ones. Wu et al. (2014) use textual and technical fea- tures to improve prediction accuracy of stock market. They use SVR 2.1.1. Input features algorithm and trend segmentation method to forecast trends and First we analyze and gather important features from the com- generate trading signals, respectively. Their feature selection algo- pany’s financial ratios and the profit and loss reports, as well as rithm is stepwise regression analysis. Although there are a variety stock pricing models (Table 1). of studies in the area of feature selection, almost all of them use a single feature selection model. Financial ratio: to have a complete list of effective features we In this research, a novel hybrid feature selection algorithm on gather 4 general groups of financial ratio as a part of input vari- the basis of filter and function-based clustering method is applied ables of companies’ database. The importance of these features to select the important features. What makes our proposed is discussed in many studies (see (Bauer, Guenster, & Otten, approach different from the previous ones is that we consider 2004; Bernstein & Wild, 1999; Carnes & College, 2006; Huang, the combination of 9 different feature selection algorithms with 2012; Omran & Ragab, 2004; Sadka & Sadka, 2009; Soliman, function-based clustering algorithm. Hybrid model of our paper 2008)), also see financial ratio’s part of Table 1. enjoys the power and advantage of correlation based algorithms Stock pricing models: we review different stock pricing mod- like Chi-square, One-R in addition to the power of classified errors els (capital asset pricing model (CAPM), Gordon, Walter, based, interval based, and information based algorithms like SVM, Campbell–Shiller, and Fama–French) and obtain other impor- Relief-f, and Gini index/gain ration algorithms respectively. The tant factors which effective on the risk and return prediction effectiveness of our model is illustrated with the prediction of both of stocks, see Table 2 (Kaplan & Ruback, 1995; Brealey, Myers, risk and return of stocks and then analyzing the results with and & Allen, 2007; Fama & French, 1993; Fama & French, 2012; without implementing of our hybrid feature selection algorithms. Gordon, 1982; Hjalmarsson, 2010; Lee, Tzeng, Guan, Chien, To sum up, in the first stage of paper, a complete list of likely & Huang, 2009; Lewellen, 2004; Mukherji, Dhatt, & Kim, effective features on the stocks risks and returns are identified. 1997). After developing an appropriate database in the second stage, dif- Company’s profit and loss reports: by using the profit and loss ferent classification algorithms are used to predict the risk and reports of companies, the other added factors are extracted. In return. We also scrutinize on the effect of their results to our data Table 1, all input variables of financial model are provided. base based on feature-oriented view point. Finally, in the third stage, a novel hybrid feature selection algorithm on the basis of fil- 2.1.2. Response variables ter and function-based clustering method is applied to select the The most important response variables in our model are real important features which affect the prediction of risk and return. return and non-systematic risk, as follows: The contribution of the paper is summarized as follow: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi r 1 r2 rn n A comprehensive and systematic study to identify the likely R¼ 1þ 1þ ... 1 þ ð5Þ 100 100 100 effective features in risk and return prediction. Stock risks as well as return prediction with different classifica- where r1, . . . , rn = real return of 1, . . . , nth periods. tion methods. Non-systematic risk is defined as the standard deviation of the Designing a hybrid feature selection algorithm on the basis of stock return, as follows. filter and function-based clustering. vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u n Finally, each algorithm with a feature-oriented view point is u 1 X analyzed. The results indicate the factors which cause strength r¼t r i EðrÞ2 ð6Þ n 1 i¼0 and weakness of that algorithm. As a result the nature of each feature is provided according to the amount of interference variable in their prediction. 2.1.3. Data pre-processing Data preparing stage is an important part of the approach. Fur- The rest of the article is organized as follows. In Section 2, the thermore, it is time consuming in data mining process, described proposed model is presented which has three stages. In Section as follows. 3, to illustrate the approach, we implement it with some real data from Tehran Stock Exchange (TSE). The results are analyzed in Removing high correlation features: features with higher than a which the predictions with and without considering important predefined correlations percent on the basis of Pearson test are effective features are also compared. Then in Section 4, a removed. S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 1327 Fig. 1. Conceptual design of the proposed model. Table 1 Financial input features. Category Features Financial Liquidity ratios Current ratio, quick ratio, current assets ratio, net ratio working capital, liquidity ratios Activity ratio Average payment period, current assets turn over, fixed asset turnover, total asset turnover Capitalization ratio Equity ratio, debt coverage ratio, debt to total assets ratio, debt to equity ratio, long-term debt to equity ratio, current debt to equity ratio Profitability ratio Percentage of net profit to sale, percentage of operating profit to sale, percentage of gross profit to sale, percentage of net profit to gross profit, return on asset (after tax) ROA, return on equity (after tax) ROE, working capital return percentage, fixed assets return percentage, assess the loan usefulness Stock Capital asset pricing model r = return ration without risk b = stock beta coefficient pricing (systematic risk) rm = expected return from market models Gordon model EPS, DPS, EPS prediction, EPS cover, prediction difference percentage of EPS with the real amount, EPS growth ration in compare to the previous fiscal year Campbell–Shiller Model P/E, P/S Walter model Stock cumulative profit Fama–French Model Company’s capital (investment), stock book value, stock market value Company’s Total predicted income (last income prediction in the current fiscal year), total income growth loss and % (total real income/(total real income – total predicted income)), predicted profit margins profit (last profit ratio/company’s income in the current fiscal year), profit margin growth rate (real reports profit margin/(real profit margin – predicted profit margin)) and Efficiency (percent of daily trading volume/company’s daily value in the before period) Missing data: defected records caused by incomplete informa- Finding outlier data: to find outlier data in database, we use the tion of company or the company’s negligence in reporting are distance-based approach which is based on data intervals also deleted from the database. Some Decision Tree algorithms (Knorr & Ng, 1999), density approach (Breunig et al., 2000) in and K Nearest neighbor techniques do not need to replace the which a parameter named Local Outlier Factor (LOF) is special- missing data. ized to each sample based on K-Nearest neighbor density. 1328 S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 Samples with high LOF are known as outlier points, clustering model as follows: EPS prediction of companies in fiscal year, EPS coverage, prediction difference percentage of EPS with the real amount, and EPS growth Fama–French offer b, size and book value to market value’s model with the help of CAPM model as a multivariate regression to study the factors affecting R: expected return, rf: rate of return without risk, b: systematic risk, rm: market expected return. A brief description about this formula can be found in g: stock profit increase, K: shareholder’s expected return ratio. From this model two important factors EPS and DPS are achieved. ROE feature have been important for analyzing and predicting the stock price, and it is inserted to database (Hjalmarsson, 2010; Lewellen, 2004). In addition to this, P/S ratio that The first part of the model is similar to sharp model. The second part shows the company size which is a factor showing the company’s capital and third mentioned also in financial ration before. In addition to this, four other features that were obtained from EPS have been regarded and inserted into the P: stock market price of each stock, r: internal rate of return, EPS–DPS: cumulative profit per share, K: capital rate cost (Brealey et al., 2007). According to database as 3 important factors. The other models like Glassman–Host and kernel are derived from the introduced methods and they do not help this approach (Hong & Wu, 2011) within the use of k means cluster- This model calculates the stock P/E average by using Market data (Brealey et al., 2007). From the literature, it was clarified that P/E parameter is very ration in compare to the previous fiscal year (Hjalmarsson, 2010). Gordon model is used in different capital market discussions like (Lee et al., 2009) part indicates the book value to market value. By using this model, company’s capital, stock Book value and stock Market value are inserted to the ing algorithm, and deviation method (Hong & Wu, 2011). 2.2. Second stage: risk and return prediction with classification This model explains the connection between expected return and risk and it is used for bonds pricing with risk (Kaplan & Ruback, 1995) methods Generally, researchers and scholars are seeking to achieve a more scientific model, ranging from Portfolio Theory by Markowitz in 1952 and Sharp assets pricing models in 1964, to Fama–French in 1992. However, they cannot solely evaluate price, risk, and return well. Bartholdy and Peare (2005) compared CAPM and Fama–French model while it appears that the latter can better explain the return deviation and can give better evidences. But Gordon has suggested this model using the investment of retained earnings to stock pricing (Gordon, 1982) regarding the real data, none of them can explain return well. Cao, Leggio, and Schniederjans (2005) concluded that the neural is result of stock price divided to each stock sale is also inserted to database (Mukherji et al., 1997) network is much more powerful than Fama–French model in stock return prediction. Dastgir and Afshari (2004) compared Walter, Gordon and current value of future cash flow stock pricing models in Tehran Stock Exchange and observed that real prices and prices obtained by models were not equal. As these studies show the tra- ditional methods cannot necessarily estimate properly. Thus, it is this model, cumulative profit per share is known as a criterion in the database necessary to apply some methods to be able to determine the com- plexity of the data. Some researchers have used different methods like neural networks and statistical methods. Among these results, Appendix A. On the basis of this model, rm, rf , b are added to database the conclusion gained by machine learning algorithm and data mining are prominent (Patel et al., 2014). Ou and Wang (2009) and Lai, Fan, Huang, and Chang (2009) concluded that Decision Tree methods have outstanding perfor- mance in stock return prediction. In addition, what is important is the rules obtained from the rule based algorithms and trees, since these rules conduct investors to buy and select the portfolio. portfolio returns (Fama & French, 1993, 2012) On the other hand, the output of the methods that are applied in this area (like SVM and NN) which do not use rules for prediction is not appropriate for practitioners. Decision Tree structure is more comprehensive, transparent and rational. On the basis of what research in finding new features was discussed, our study focuses on tree and rule based algorithms in order to be more appropriate for investors and analysts. Levin and Zahavi (2001) concluded that data correlation problem in tree algorithms is more transparent than statistical algorithms, and it can be solved by Pruning algorithms. Chang (2011) compared CART, back propagation, and CART–back propagation hybrid method from the point of view of stock price prediction based on fundamental data and concluded that back propagation and Deci- sion Tree accuracy perform better than the hybrid methods. In this study, by using different classification methods, risk and return are predicted on the basis of the given features and data- base. A comparison between different methods is performed. Actu- ð4Þ ð1Þ ð3Þ ð2Þ ally, this section is done for two times. In the first time the prediction is done with all features but in the second time, the best M b selected features from hybrid feature selection algorithm are pre- M ri ¼ a þ bðr m Þ þ bsize ðSizeÞ þ b b dicted. A comprehensive comparison between these two predic- tions is also done. In other words, in this paper we compare the accuracy of risk and return forecasts with and without feature R ¼ r f þ bðrm r f Þ DPS ¼ EPS DPR P ¼ DPSþðEPSDPSÞr=k selection, based on different classification methods and explain the effect of feature selection on classification methods. The classi- P/E, P/S ratio k fication algorithms are shown in Fig. 1. DPS P ¼ kg 2.2.1. Testing strategy In order to get robustness prediction, we perform 10-fold cross- Stock pricing models. validation model on the predictors (duda, Hart, & Strok, 2001). This Shiller model Gordon model Walter model Fama–French method has been proved to be statistically good enough in evalu- Campbell– ating the performance of the predictive model (Mitchell, 1997). model In 10-fold cross-validation, the training set is equally divided into CAPM Table 2 10 different subsets. Nine out of 10 of the subsets are used to train the classifier and the tenth subset is used as the test set. The S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 1329 Fig. 2. Hybrid feature selection. Table 3 2.3.1. Filter methods Confusion matrix for five classes. According to Witten and Frank (2011) 7 algorithms were defined as Filter method: Chi square (Kononenko, 1994), Info Gain Predicted class (Dumais, Platt, Heckerman, & Sahami, 1998), Gain Ratio (duda Very low Low Normal High Very high et al., 2001), Relief-f (Kononenko, 1994), consistency (Liu & Actual class Very low a1 b1 c1 d1 e1 Setiono, 1996), One R (Holte, 1993) and CFS (Hall, 1998). In addi- Low a2 b2 c2 d2 e2 tion, Symmetrical Uncertainty and SVM algorithm are also used Normal a3 b3 c3 d3 e3 High a4 b4 c4 d4 e4 for weighting the features (Chen & Cheng, 2012). In this section, Very high a5 b5 c5 d5 e5 to compare the importance of each feature using mentioned meth- ods, a comprehensive analysis was conducted on the features and eventually the weightings of features are presented. procedure is repeated 10 times, with a different subset being used as the test set and the best result has been chosen. 2.3.2. Function-based clustering method In order to reliably evaluate the predictors, we consider not only After attaining the features weights by different filter based prediction accuracy but sensitivity and specificity. The accuracy of algorithms we have n attributes with m attributes’ weight and a predictor on a given test set is the percentage of test set tuples then we need a model to determine the important features’ clus- that are correctly predicted by the predictor. Prediction accuracy tering between these weighted attributes. In this section we for five classes can be measured by a confusion matrix shown in develop Li (2006a) function-based clustering method. This model Table 3 with formula (1). is based on hierarchical divisive clustering method which begins with one cluster including all objects, (Xnm). a1 þ b2 þ c3 þ d4 þ e5 For the object x1, . . . , xn, we denote the vector of group member- Accuracy ¼ P5 ð7Þ ship of objects as z = (z1, . . . , zn)T, where z 2 Z, and Z is the space of i¼1 ai þ bi þ c i þ di þ ei sign vectors defined to be Sensitivity is also referred to the proportion of positive tuples Z ¼ fz ¼ ðz1 ; . . . ; zn ÞT jzi ¼ 1g ð8Þ that are correctly identified while specificity is the proportion of negative tuples that are correctly identified (Han & Kamber, 2006). All objects that are associated with an entry of 1 in z are classified into one group, whereas the others with an entry of 1 are classi- fied into the other group. 2.3. Third stage: hybrid feature selection Then by using the model of multivariate analysis of variance defined to be Based on special conditions in stock exchange, occasionally we xi ¼ l þ zi c þ ei ; i ¼ 1; 2; . . . ; n ð9Þ encounter with many attributes whereas some of them no longer have useful information and just complicate the condition. For this where the error vectors ei are assumed to be normally distributed reason feature selection is one of the very crucial aspect that has a with a zero mean and a common covariance matrix V, i.e. N(0, V). highly regarded recommendation (Huang, 2012; Huang, Yang, & In addition ei and ej (i – j) are assumed to be independent. Then Chuang, 2008; Tsai & Hsiao, 2010). by maximum likelihood, the clustering problem is formulated as a In this section to investigate the features which have greater least squares optimization problem. effect on risk and return and better analysis of algorithms results n o a novel feature selecting method in 2 levels is established. It should min ðz a1 XbÞT ðz a1 XbÞ ð10Þ a;b;z2Z be noted that feature selecting in capital markets issues has double importance. The reason is that we encounter with so many features Simultaneously the unknown vector of cluster membership and the that are either useless or have low information value. Thus, dealing coefficients of the linear clustering function are estimated. The with these features is time wasting without any gain. Feature computation of the clustering-function-based method will be con- selection methods are generally divided into three categories: (1) verted to that of sign analysis (Li, 2006b), and by problem solving filter methods, (2) wrapper methods, and (3) hybrid methods two clusters is achieved. (Chen & Cheng, 2012). In this approach we use a hybrid model Next, one of these groups based on higher within-group disper- based on combination of filter and function-based clustering sion matrix is further divided into two dissimilar subgroups. The method to extract a set of efficient features as a follow (see Fig. 2). process continues until some stopping criterion has been satisfied. 1330 S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 Fig. 3. Experimental results process. Most of the stopping criteria are based on within-group dispersion prediction while these 2 features together show the portfolio effi- and/or between-group dispersion matrices. cient frontier and investors can use it to select the best optimal By this approach we use the advantages of different filter meth- portfolio. The process is illustrated in Fig. 3. ods and use these weighting attributes by function-based cluster- ing method to make more accurate decision of effective feature. 3.1. Data pre-processing 3. Experimental results and analysis Removing high correlation features: features with higher than 0.95% correlations on the basis of Pearson test were removed. In this study a database including 44 input features and 2 goal Therefore, features including: gross profit to sale percentage, features are gathered from TSE data from 2003 to 2012. The result- assess the loan usefulness, stock cumulative profit, fixed assets ing database has 1963 records for 400 companies. return percentage, debt to equity ratio, and current debt to According to a group of experts, 5 intervals were introduced for equity ratio are removed. Their correlation with return and risk the real return: very high with a range higher than 9.3, high with variables is higher than 95% in comparison with the other fea- the range of 4–9.3, average with a range of 1.14–4, low with the tures. The correlation between real return and risk with other range of 1.3 to 1.14 and very low that lower than 1.3. Risk is features has been illustrated in Tables 4 and 5 respectively. also classified in 3 intervals: high in range of higher than 15.5, Finding outlier data and miss data: to find outlier data in data- average in range of 6.3–15.5 and low in rage of lower than 6.3. base, at first we used the distance-based approach and by ana- According to literature, it is found out that to predict return, nega- lyzing remote records we concluded that some are very large tive and positive return(Tsai et al., 2011; Wang & Chan, 2006) and governmental companies that are not applicable in our study negative and positive return trend (Enke & Thawornwong, 2005) and in fact they are not outlier data. Other outliers were also are used (see also (Patel et al., 2014; Yu et al., 2014; Zhang, Hu, deleted. By using the density approach 12 records were known Xie, Wang, et al., 2014)). For more accuracy, we increased the pre- as outlier points, in which 7 of them were large companies. diction intervals. These intervals give more information to inves- Thus they remained in the database. However, others were tors and they can develop a balance between the share price and omitted, mostly because they did not provide accurate informa- the future gained return. In fact, the information which is limited tion. With clustering approach we determined some outlier to the company’s profitability or losses does not help them very data that were in none of the clusters. As a result 6 samples much. Beside this, risk has been rarely mentioned in prediction were identified as outlier data. By analyzing input feature of field of stock exchange. We can conclude whether the proposed the company, no suspected case was found and no company return range is optimal or not just by knowing the risk amount. was omitted. Finally, by using the techniques based on the devi- The previous studies of the field have just focused on return ation, 5 records were known as outlier points. Analyzing records S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 1331 Table 4 Correlation between real return and other features. Debt to total Net profit to Operating Net profit to ROA ROE Current assets Fixed asset Current ratio Quick ratio assets sale profit to sale Gross profit turn over turnover ratio 0.0023 0.017 0.0488 0.0131 0.1494 0.0113 0.0005 0.0019 0.0304 0.0103 Return from Net working Average Current Working Total asset Equity ratio Predicted Long-term debt Debt market capital payment assets ratio capital turnover profit to equity ratio coverage period return margins ratio 0.0218 0.0251 0.0113 0.0128 0.0021 0.0937 0.0977 0.0066 0.0011 0.0246 Efficiency DPS EPS Capital EPS EPS difference Return ratio EPS growth Total predicted prediction with real EPS without risk income 0.0135 0.1168 0.1402 0.0059 0.1211 0.0037 0.0384 0.0194 0.0547 Total income Profit margin EPS cover Liquidity P/E Stock book P/S Stock market Beta coefficient growth growth rate ratios value value 0.0231 0.0135 0.0171 0.0378 0.0058 0.0377 0.0109 0.0651 0.0153 Table 5 Correlation between risk and other features. Debt to total Net profit to Operating Net profit to ROA ROE Current assets Fixed asset Current ratio Quick ratio assets sale profit to sale Gross profit turn over turnover ratio 0.0203 0.0301 0.0444 0.0148 0.0307 0.0176 0.0055 0.0062 0.05 0.0015 Return from Net working Average Current Working Total asset Equity ratio Predicted Long-term debt Debt market capital payment assets ratio capital turnover profit to equity ratio coverage period return margins ratio 0.0414 0.0056 0.0099 0.0057 0.0021 0.0195 0.0112 0.0296 0.0212 0.0126 Efficiency DPS EPS Capital EPS EPS difference Return ratio EPS growth Total predicted prediction with real EPS without risk income 0.0135 0.0099 0.0222 0.0317 0.02251 0.0156 0.0935 0.0159 0.0247 Total income Profit margin EPS cover Liquidity P/E Stock book P/S Stock market Beta coefficient growth growth rate ratios value value 0.0369 0.0031 0.0313 0.0238 0.0114 0.0089 0.0149 0.0251 0.0283 Table 6 Algorithms Comparison for real return variable. Algorithm Accuracy Sensitivity Specificity Number of rules Tree size Number of leaves LAD Treea (Hall et al., 2009) 78.00 77.15 75.29 – 31 15 Cart Decision Treeb 76.50 74.27 74.3 – 13 7 DTNB rulec (Hall & Frank, 2008) 76.00 75.08 73.55 998 – – Decision tabled 75.50 75.14 72.44 56 – – Rep Treee 75.00 74.09 71.64 – 33 17 RIDOR rules (Witten & Frank, 2011) 75.00 75.3 73.07 208 – – J Rip Rule (Witten & Frank, 2011) 74.90 73.18 74.02 9 – – BF Treef (Shi, 2007) 74.50 78.49 74.28 – 9 5 Part Ruleg (Frank & Witten, 1998) 72.60 67.84 69.15 104 – – J 48 Graph 71.50 69.68 66.38 – 1619 810 NB Tree (Witten & Frank, 2011) 71.00 70.2 69.5 – 97 49 LMT Tree (Witten & Frank, 2011) 70.50 – 30 20 Neural Net (RBF) 70.00 67.3 66.4 – – – Neural Net (MLPh) 69.00 66.3 67.1 – – – Auto MLPi 70.00 67.01 66.02 – – – Rule Inductionj 68.50 69.27 66.76 56 – – FT Tree 68.50 66.3 64.8 – 45 28 J 48 Tree 67.39 66.2 65.6 – 303 152 ID3 Numerical 61.50 58.42 57.47 – 1905 979 Bays 60.00 58 62.28 – – – a It uses AD Tree with boosting for prediction and cross validation to select training data and class label decisions are done on the basis of this algorithm most votes. b The used tree’s split has been done by Gini index algorithm and the Pruning is done based on cost – complexity after constructing the tree. c At first, these models determined the important variables by using Naive Bays algorithm (18 attribute achieve) and then offer classification prediction rules are provided by Decision Tree. d Algorithm first uses the Forward election algorithm to determine the input variables. After 375 implementing the algorithm, ROE, Net working capital, EPS prediction and return are known as effective features and then based on best first (BF) algorithm the model is constructed. e This algorithm is based on information gain and its Pruning is based on prediction error minimization. f BF Tree uses binary pruning to construct a tree based on selecting the first important feature as nodes point. Based on this, the model has found the feature that best predicts the output variable among other input variables. Then a binary tree based on this variable is constructed. Best–First (BF) Decision Tree just use return variable to prediction real return. (Just this input is considered in this method’s tree). g This method has used C4.5 algorithm in every implantation and has used the best of them as a new rule in the model rules. h Feed forward multi layout perceptron based on Levenberg–Marquardt algorithm with 12 neuron in hidden layer. i Number of neurons and hidden layers are optimized. j This algorithm gains the rules based on the information gain and first gained rule is: if return without risk 6 11.967 and return without risk > 12.164 and EPS > 51.447 and EPS coverage percent > 164.500 then low. This algorithm, gains the rule based on the information gain and based on decrease amount in model accuracy while constructing the rules we prune them. And will construct the model till there is no other variable to be added to the model or the error amount is more than 0.5. 1332 S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 Fig. 4. Real return prediction – Cart Decision Tree. clarified that input feature of these records were pertained to trees compared to larger trees, they have higher accuracy in test the previous fiscal year and then were removed. Also, among data. 1963 records, 12 records because of miss value derived from DTNB Rule algorithm has the highest accuracy in comparison lack of information are deleted. with the other ‘‘If-Then Rules’’ algorithms on test data. On average, the accuracy of ‘‘If-Then Rules’’ algorithms is better than trees, 3.2. Comparison of algorithms however the best prediction is obtained by LAD Tree. Some algo- rithms of tree types are shown in Figs. 4–7 (returns in all figures’ Table 6 shows the results of Decision Trees, rule base algo- nodes are return ration without considering risk). rithms, and neural networks accuracy for real return prediction. Similarly, risk is predicted as shown in Table 7. As it is clear (from Table 6), LAD Tree algorithm has achieved a It can be stated that generally larger trees (for example higher higher accuracy in the prediction of return. The other algorithms, than 300) and smaller trees have no prominent results in compar- like SVM and K-Nearest neighbor, had accuracy of close to 60%. ison to medium sized trees. In comparison to ‘‘If-Then Rules’’ algo- Thus, due to their low accuracy, we did not apply them for analysis. rithms, the highest prediction accuracy for test data is gained from Low accuracy of SVM algorithm can be because of its high sensitiv- DTNB, similar to real return prediction. For risk prediction also ‘‘If- ity to the missed data. In fact, it is generally an algorithm to predict Then Rules’’ algorithms accuracy is also better than those gained 2 class outputs, while we are dealing with multi-class data and by tree, but the best prediction is gain by LAD Tree. To predict risk, many missed data. neural network results have lower accuracy in comparison with From investigating the results it can be stated that generally the prediction of return. In Fig. 8 LAD Tree for risk prediction is denser trees have shown better accuracy than big ones. This is depicted. clear in the cases of ID3 Numerical, J48 Graph, and J48 Tree. Due to the importance of pruning after tree construction, since these 3.3. Prediction after hybrid feature selection models have no pruning stage or their pruning algorithm is not efficient, their accuracy is not acceptable. In this section, we develop a hybrid feature selection to evaluate Trees with the size of less than 33 for real returns have an accu- each feature. In the first stage by using the above mentioned filter racy of higher than 70%. Despite the medium accuracy of these methods, a comprehensive analysis of the features is conducted to S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 1333 Fig. 5. Real return prediction – LAD Tree. We have applied Rapid Miner and Weka software to implement 0 Return< -13.10281: very low (365.0/62.0) 0 Return >= -13.10281 the algorithms (Hofmann & Klinkenberg, 2013). Furthermore, the 1 Return < 10.81819: Low (392.0/96.0) algorithm which is used by Weka is mentioned, like Weka IG, 1 Return >= 10.81819 Weka Chi-2 . . . . 2 Return < 53.97851: normal (326.0/126.0) The resulting weights for features of risk parameter and real 2 Return >= 53.97851 return parameter are shown in Tables 8 and 9, respectively. 3 Return < 136.66997: high (184.0/63.0) 3 Return >= 136.66997: Very high (72.0/27.0) After attaining the features weights by different filter based algorithms we have 13 columns (m) with 44 attributes’ weight Fig. 6. Best-first Decision Tree. (n) and then because of obtain accurate clusters, we use prepro- cessing on this data set. Therefore, the seventh column (Weka IG) of the Tables 4 and 5 is put aside from the analysis, because assign suitable weights for features. (All features even high corre- of its correlation with other ones. This way we do the grouping lation features are considered.) for 12 columns and 44 features. Depending on the evaluation function, we applied the following Then, we use the clustering-function-based method for cluster- filter methods: ing attributes to predict the most important features. Usually the first and second clusters are the effective ones and we choose them Based on interval method: Relief-f method. (Li, 2006b). Based on information: Gini index, Information Gain Ratio, Infor- For real return, in the first step of clustering, return and mar- mation gain, Symmetrical Uncertainty methods. ket return is separated from the other attributes. In other words, Based on correlation: Chi square, One R methods. they are more important with higher weights compared with Based on consistency: consistency method. other attributes. In the second step, 6 features out of 43 remain- Based on classified errors: SVM method. ing ones are separated. Finally, eight features in the first and second clusters are considered as important features. Similarly, The CFS method did not have any acceptable result because of for risk parameter in the first step, three features, return, beta dependence and was eliminated in method pre-processing step. coefficient and efficiency of 44 features can be separated. In 1334 S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 Fig. 7. Real return prediction – Rep Tree. Table 7 Algorithm comparison for risk variable. Algorithma Accuracy Sensitivity Specificity Number of rules Tree size Number of leaves LAD Tree 78.24 69.62 80.24 – 31 20 DTNB rule 77.41 69.44 78.51 426 – – Decision table 76.57 65.38 81.32 297 – – BF Treeb 76.15 67.41 74.71 – 109 55 J Rip Rule 74.90 73.7 74.3 9 – – J 48 Graph 73.64 64.68 71.66 – 721 361 Part Rule 73.64 65.36 71.64 55 – – Rep Tree 72.80 67.13 69.39 – 77 39 Rule Induction 71.55 64.45 70.25 59 – – J 48 Tree 71.55 70.9 72.5 – 313 157 FT Tree 67.78 65.6 64.8 – 63 32 NB Tree 66.95 66.3 64.7 – 7 4 Neural Net (MLP) 59.00 61.2 59.22 – – – ID3 Numerical 57.00 55.1 54.2 – 553 403 Bays 55.65 57.3 50.2 – – – a Some algorithms that used in real return prediction have low prediction accuracy in risk prediction and we do not report their results. b The beta coefficient is known as the first leaf. the second step, 12 features out of the 42 remaining ones can be As can be seen, more features were selected for the risk vari- separated. Finally 15 features from 44 features are selected as able than with real return. The classification results with selected important ones. The results for risk and real return parameters features show in parenthesis at Table 11, in which are presented in the Table 10. ‘‘Deviation = Accuracy base on selected feature – Accuracy base S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 1335 Fig. 8. Risk prediction – LAD Tree. Table 8 Weighting for risk parameter. Attributes Chi-2 IG Info R-f SVM Consistency Weka Weka Weka Weka Weka Gini Weka Ratio Gain IG chi-2 con IGR R-f index One R Return 1 0.69 0.61 1 0.024 0.87 0.513 0.437 0.513 0.63 0.9 0.725 0.759 Beta 0.188 0 0.062 0.397 0.079 0.163 0.097 0.073 0.097 0.132 0.519 0.061 0.124 Efficiency 0.091 0.707 0.076 0.071 0.083 0.099 0.056 0.044 0.056 0.112 0.093 0.093 0.127 Market return 0.058 0.67 0.072 0.019 0.023 0.058 0.106 0.092 0.106 0.157 0.021 0.064 0.117 EPS prediction % 0.008 0.615 0.099 0 0.076 0.005 0.05 0.039 0.05 0.17 0.002 0.112 0.111 Long-term debt to equity 0.032 0.487 0.076 0.011 0.029 0.039 0.085 0.078 0.085 0.136 0.015 0.077 0.118 Total income growth % 0.021 0.707 0.055 0.005 0.045 0.016 0.03 0.025 0.03 0.16 0.002 0.075 0.095 EPS growth % 0.025 0.707 0.046 0.005 0.061 0.016 0.026 0.021 0.026 0.136 0.002 0.061 0.095 ROE 0.058 0.328 0.074 0.054 0.07 0.073 0.039 0.031 0.039 0.135 0.046 0.092 0.089 DPS 0.046 0.328 0.078 0.047 0.025 0.063 0.041 0.033 0.041 0.15 0.045 0.089 0.123 Debt to total assets ratio 0.07 0.338 0.057 0.035 0.065 0.086 0.03 0.024 0.03 0.141 0.034 0.057 0.1 Profit margin growth rate 0.01 0.615 0.036 0.005 0.084 0.008 0.021 0.017 0.021 0.136 0.009 0.045 0.055 EPS 0.046 0.421 0.024 0.063 0.088 0.045 0.016 0.013 0.016 0.056 0.131 0.033 0.093 P/E 0.027 0.328 0.064 0.013 0.108 0.041 0.032 0.025 0.032 0.102 0.002 0.075 0.127 Predicted profit margin 0.053 0.319 0.067 0.071 0.023 0.07 0.036 0.028 0.036 0.11 0.037 0.063 0.047 Stock market value 0.029 0.615 0.023 0.013 0.043 0.029 0.016 0.012 0.016 0.065 0.011 0.027 0.049 ROA 0.033 0.422 0.038 0.01 0.038 0.046 0.023 0.019 0.023 0.152 0 0.056 0.049 Current debt to equity ratio 0.039 0.381 0.046 0.003 0.043 0.049 0.026 0.018 0.026 0.121 0.002 0.039 0.092 Debt to equity 0.036 0.419 0.047 0 0.012 0.05 0.041 0.039 0.041 0.114 0.005 0.026 0.043 Book value 0.015 0.615 0.013 0.004 0.131 0.016 0 0 0 0 0.001 0.021 0.048 Assess the loan usefulness 0.039 0.421 0.026 0.01 0.031 0.052 0.018 0.016 0.018 0.122 0.006 0.018 0.075 Equity ratio 0.018 0.615 0.003 0.001 0.099 0.019 0 0 0 0 0 0.011 0.04 Gross profit to sale 0.021 0.107 0.055 0.005 0.034 0.016 0.03 0.025 0.03 0.134 0.002 0.074 0.059 Current ratio 0.016 0.615 0.008 0.002 0.073 0.018 0 0 0 0 0 0.013 0.032 Operating profit to sale 0.017 0.615 0 0.001 0.09 0.02 0 0 0 0 0.004 0.002 0.023 P/S 0.014 0.615 0.01 0.002 0.06 0.015 0 0 0 0 0 0.018 0.037 Fixed asset turnover 0.032 0.421 0.006 0.027 0.059 0.024 0 0 0 0 0.128 0.011 0.061 Fixed assets return 0.016 0.615 0.008 0.002 0.058 0.018 0 0 0 0 0 0.013 0.032 Debt coverage ratio 0.005 0.366 0.041 0.002 0.045 0.008 0.024 0.019 0.024 0.073 0 0.052 0.088 Liquidity ratio 0.015 0.421 0.028 0.011 0.018 0.022 0.033 0.027 0.033 0.083 0.006 0.025 0 Net profit to sale 0.023 0.377 0.021 0.004 0 0.04 0.015 0.012 0.015 0.046 0.001 0.03 0.068 Working capital return 0.002 0.319 0.027 0.012 0.094 0.003 0.017 0.015 0.017 0.064 0.001 0.02 0.049 percentage EPS deviation 0 0.338 0.026 0.006 0.033 0 0.017 0.014 0.017 0.06 0 0.038 0.067 Capital 0.034 0.371 0.016 0.01 0.04 0.036 0 0 0 0 0.025 0.022 0.052 Current assets ratio 0.034 0.371 0.016 0.009 0.024 0.037 0 0 0 0 0.025 0.022 0.041 Total predicted income 0.009 0.347 0.024 0.003 0.026 0.015 0.016 0.013 0.016 0.051 0 0.034 0.019 Net profit to gross profit 0.016 0.338 0.009 0.03 0.068 0.019 0 0 0 0 0.027 0.015 0.013 EPS coverage percent 0.008 0.328 0.016 0.014 0.061 0.016 0 0 0 0 0.003 0.024 0.063 Net working capital 0.015 0.358 0.009 0.019 0.011 0.025 0 0 0 0 0.001 0.013 0.064 Average payment 0.018 0.319 0.011 0.034 0.024 0.034 0 0 0 0 0.013 0.007 0.053 Total asset turnover 0.015 0.358 0.008 0.019 0.011 0.025 0 0 0 0 0.001 0.013 0.054 Quick ratio 0.027 0.328 0.009 0.001 0.043 0.038 0 0 0 0 0.01 0.015 0.027 Stock cumulative profit 0.012 0.338 0.004 0.005 0.016 0.018 0 0 0 0 0.001 0.008 0.033 Current assets turnover 0.012 0 0 0.01 0.073 0.019 0 0 0 0 0.007 0 0.051 1336 S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 Table 9 Weighting for real return parameter. Attributes Chi-2 IGR IG R-f SVM Consistency Weka Weka Weka Weka Weka Gini Weka IG chi-2 con IGR R-f index oneR Return 1 0.69 0.61 1 0.024 0.87 0.513 0.437 0.513 0.63 0.9 0.725 0.759 Market return 0.188 0 0.062 0.397 0.079 0.163 0.097 0.073 0.097 0.132 0.519 0.061 0.124 ROA 0.091 0.707 0.076 0.071 0.083 0.099 0.056 0.044 0.056 0.112 0.093 0.093 0.127 Beta 0.058 0.67 0.072 0.019 0.023 0.058 0.106 0.092 0.106 0.157 0.021 0.064 0.117 ROE 0.008 0.615 0.099 0 0.076 0.005 0.05 0.039 0.05 0.17 0.002 0.112 0.111 EPS growth % 0.032 0.487 0.076 0.011 0.029 0.039 0.085 0.078 0.085 0.136 0.015 0.077 0.118 Profit margin growth rate 0.021 0.707 0.055 0.005 0.045 0.016 0.03 0.025 0.03 0.16 0.002 0.075 0.095 Operating profit to sale 0.025 0.707 0.046 0.005 0.061 0.016 0.026 0.021 0.026 0.136 0.002 0.061 0.095 EPS 0.058 0.328 0.074 0.054 0.07 0.073 0.039 0.031 0.039 0.135 0.046 0.092 0.089 DPS 0.046 0.328 0.078 0.047 0.025 0.063 0.041 0.033 0.041 0.15 0.045 0.089 0.123 EPS prediction % 0.07 0.338 0.057 0.035 0.065 0.086 0.03 0.024 0.03 0.141 0.034 0.057 0.1 Predicted profit margin 0.01 0.615 0.036 0.005 0.084 0.008 0.021 0.017 0.021 0.136 0.009 0.045 0.055 Net profit to sale 0.046 0.421 0.024 0.063 0.088 0.045 0.016 0.013 0.016 0.056 0.131 0.033 0.093 EPS deviation 0.027 0.328 0.064 0.013 0.108 0.041 0.032 0.025 0.032 0.102 0.002 0.075 0.127 Efficiency 0.053 0.319 0.067 0.071 0.023 0.07 0.036 0.028 0.036 0.11 0.037 0.063 0.047 P/S 0.029 0.615 0.023 0.013 0.043 0.029 0.016 0.012 0.016 0.065 0.011 0.027 0.049 Net profit to gross profit 0.033 0.422 0.038 0.01 0.038 0.046 0.023 0.019 0.023 0.152 0 0.056 0.049 Quick ratio 0.039 0.381 0.046 0.003 0.043 0.049 0.026 0.018 0.026 0.121 0.002 0.039 0.092 Equity ratio 0.036 0.419 0.047 0 0.012 0.05 0.041 0.039 0.041 0.114 0.005 0.026 0.043 Stock market value 0.015 0.615 0.013 0.004 0.131 0.016 0 0 0 0 0.001 0.021 0.048 Book value 0.039 0.421 0.026 0.01 0.031 0.052 0.018 0.016 0.018 0.122 0.006 0.018 0.075 Long-term debt to equity 0.018 0.615 0.003 0.001 0.099 0.019 0 0 0 0 0 0.011 0.04 Gross profit to sale 0.021 0.107 0.055 0.005 0.034 0.016 0.03 0.025 0.03 0.134 0.002 0.074 0.059 Debt to equity ratio 0.016 0.615 0.008 0.002 0.073 0.018 0 0 0 0 0 0.013 0.032 Debt coverage ratio 0.017 0.615 0 0.001 0.09 0.02 0 0 0 0 0.004 0.002 0.023 Current debt to equity 0.014 0.615 0.01 0.002 0.06 0.015 0 0 0 0 0 0.018 0.037 Net working capital 0.032 0.421 0.006 0.027 0.059 0.024 0 0 0 0 0.128 0.011 0.061 Assess the loan usefulness 0.016 0.615 0.008 0.002 0.058 0.018 0 0 0 0 0 0.013 0.032 Stock cumulative profit 0.005 0.366 0.041 0.002 0.045 0.008 0.024 0.019 0.024 0.073 0 0.052 0.088 P/E 0.015 0.421 0.028 0.011 0.018 0.022 0.033 0.027 0.033 0.083 0.006 0.025 0 Liquidity ratio 0 0.615 0.007 0.007 0.025 0 0 0 0 0 0.001 0.012 0.02 Working capital return 0.023 0.377 0.021 0.004 0 0.04 0.015 0.012 0.015 0.046 0.001 0.03 0.068 percentage Total income growth % 0.002 0.319 0.027 0.012 0.094 0.003 0.017 0.015 0.017 0.064 0.001 0.02 0.049 Current ratio 0 0.338 0.026 0.006 0.033 0 0.017 0.014 0.017 0.06 0 0.038 0.067 Current assets ratio 0.034 0.371 0.016 0.01 0.04 0.036 0 0 0 0 0.025 0.022 0.052 Debt to total assets ratio 0.034 0.371 0.016 0.009 0.024 0.037 0 0 0 0 0.025 0.022 0.041 Current assets turn over 0.009 0.347 0.024 0.003 0.026 0.015 0.016 0.013 0.016 0.051 0 0.034 0.019 Total asset turn over 0.016 0.338 0.009 0.03 0.068 0.019 0 0 0 0 0.027 0.015 0.013 EPS coverage percent 0.008 0.328 0.016 0.014 0.061 0.016 0 0 0 0 0.003 0.024 0.063 Fixed assets turn over 0.015 0.358 0.009 0.019 0.011 0.025 0 0 0 0 0.001 0.013 0.064 Total predicted income 0.018 0.319 0.011 0.034 0.024 0.034 0 0 0 0 0.013 0.007 0.053 Fixed assets return 0.015 0.358 0.008 0.019 0.011 0.025 0 0 0 0 0.001 0.013 0.054 Average payment 0.012 0.338 0.004 0.005 0.016 0.018 0 0 0 0 0.001 0.008 0.033 Capital 0.012 0 0 0.01 0.073 0.019 0 0 0 0 0.007 0 0.051 Table 10 Selected features for risk and real return parameters. Selected features of the first and second cluster, based on Return, beta coefficient, efficiency, market return, EPS prediction, percent of growth EPS, DPS, P/E, function clustering method for risk parameter EPS, equity ratio, stock book value, debt to total assets ratio, predicted profit margin, P/S, total incomes growth Selected features of the first and second cluster, based on Return, market return, beta coefficient, return on asset (ROA), percent of growth EPS, EPS, predicted function clustering method for real return parameter profit margin, EPS coverage percent on all feature’’. If deviation is positive, it means that, use of selected features have different effect on the accuracy of forecast- important feature will improve the prediction results and vice ing. Some trees with large structure, such as J48 Graph and J48 Tree versa. are get lower accuracy, while some get a higher accuracy such as As results show from Table 11, by this hybrid method we can get ID3 Numerical. Higher accuracy of all algorithms is due to the fact better prediction in some methods with fewer numbers of features. that the hybrid feature selection model, as a pruning algorithm, is used to reduce over training error. Bays algorithms for both real 4. Discussion return and risk obtained weak prediction. Thus, DTNB and also NB Tree output for both real return and risk achieved lower accu- 4.1. The real return results in prediction with selected features racy. The results show that the rule base algorithms with average number of rules, such as Part Rule, decision table and Rule Induc- If for denser structure trees all effective features in first predic- tion, obtain better results. On the other hand, the accuracy of the tion are selected by the proposed hybrid model, results in better algorithms with fewer rules like J Rip Rule has descended. The accuracy, such as ‘‘BF Tree’’, ‘‘LAD Tree’’, and ‘‘FT Tree’’. Otherwise, accuracy of the neural network for each output has increased, it is possible that accuracy drops, like ‘‘CART and Rep’’ TREEs. The because of not getting stuck in local optimum points. S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 1337 Table 11 Moreover, by using weight of features obtained from Chi-2 or IG Algorithms deviation. Ratio or Info Gain algorithms (without using the hybrid model), the Algorithm Risk accuracy deviation Return accuracy deviation return and market returns features to predict real return get the LAD Tree %2 (80.24%) %1 (79%) highest weight (90% of cumulated weight). Maybe, the high per- Cart Decision Tree %1.5 (66.5%) %4.5 (72%) centages predicted by BF Tree and LAD Tree algorithms are due DTNB rule %0.9 (76.51%) %1 (75%) to these two features. Also to risk parameter, return, beta coeffi- Decision table %1.2 (75.37%) %0.07 (76.20%) cient, and efficiency features get the highest weight (90% of cumu- BF Tree %2 (74.15%) 1.5 (76%) J Rip Rule %0 (74.90%) %1.2 (73.7%) lated weight). Thus the high accuracy of LAD Tree, FT Tree, Rep J 48 Graph %1.83 (71.81%) %2 (69.50%) Tree, and Rule Induction algorithms could be due to these three Part Rule %1.91 (75.55%) %2 (74.6%) features. Because these algorithms emphasize these high weight Rep Tree %0.77 (73.52%) %2 (73%) features more than others. Rule Induction %0.95 (72.50%) %1.5 (70%) A comparison between our method and similar researches is J 48 Tree %1.5 (70. 05%) %0.50 (66.89%) FT Tree %2 (69.18%) %2 (70.5%) illustrated in Table 12. Six hybrid methods which have a brilliant NB Tree %4.20 (62.75%) %1 (70%) accuracy in return forecasting in different country stock exchange Neural Net (MLP) %2.00 (61.00%) %2.5 (71.5%) compared based on input data, base classifier, feature selection, ID3 Numerical %2.5 (59.5%) %2 (63.5%) hybrid prediction model and the best accuracy as follow: Bays %1.40 (54.15%) %2.00 (58.00%) We also exerted data dimension reduction methods including: Principle Component Analyses (PCA), Independent Component Analysis (ICA), Factor Analysis (FA), Discrete Wavelet Transform (DWT), and Discrete Fourier Transform (DFT) methods on data 4.2. The risk results in prediction with selected features set. Our results on this methods show that despite of long runtime the accuracy of prediction algorithms highly decreased. As an Due to the large number of features extracted from the hybrid instance, after the reduction of dimensionality from 44 to 11 with feature selection algorithm for risk, the moderate size tree, such PCA algorithm, the LAD Tree prediction results get 52.7% which is a as Rep Tree, FT Tree, and LAD Tree have better accuracy than very low accuracy. before. However, BF Tree accuracy has been decreased because of Moreover, the data reduction process time in this data is very removing 2 effective attribute. high and as an instance, based on Rapid Miner Software it takes With this analysis it is also clear that the algorithms, such as 11 h and 32 min in DWT algorithms. Although by using MATLAB LAD Tree, which use the features beta coefficient, market return, algorithm, the execution time is less than before, but the accuracy P/E, and the efficiencies, obtained the highest accuracy which has of the results will not differ much. Among these 5 algorithms, ICA improved up to 80.24%. Large trees such as J 48 Tree and J 48 Graph results despite of long execution time (approximately 31 h with get lower prediction accuracy but ID3 Numerical results are Rapid Miner Software) obtain better prediction accuracy and the improved. As said before, the prediction result of bays based algo- accuracy predicted based on LAD Tree is 71%. rithms like DTNB and NB Tree have been decreased but the large drop in NB Tree prediction is because of its dense structure. For other rule base algorithms the prediction result have been 5. Conclusions improved or remained stable, except decision table algorithm. This is derived from the average number of rules that are covered by the In this study, an approach for simultaneous prediction of risk selected features. and real return were developed by applying data mining technique Table 12 Comparison results with other studies. Author /year Stock exchange Input data Base classifier Feature selection Hybrid The best model accuracy % Tsai et al. Electronic 19 financial ratios and 11 MLP–Cart–logistic regression – Bagging– 66.67 (2011) Industry in macroeconomic indicators Voting Taiwan Huang 30 special 14 financial ratios SVR–GA – – 85–76.71 (2012) companies in Taiwan Cheng et al. Taiwan 10 technical indexes and 8 PNN–C4.5–rough Set – Hybrid 76 (2010) macroeconomic indicators Huang et al. South-Korea and 23 technical indexes SVM–K–NN–Cart–logistic Wrapper Voting 76.06 (2008) Taiwan regression–back propagation 80.28 Tsai and Taiwan 8 fundamental index and 11 – GA–PCA–Cart Back 79 Hsiao macroeconomic indicators propagation (2010) Tsai, Lu, and Taiwan 61 intangible assets value MLP PCA–stepwise regression– MLP 75 Yen variable Decision Trees–association rules– (2012) GA Zhang et al. Shanghai stock 50 financial and fundamental NB–SVM–J48–LR–NN Casual feature selection – 55 (2014) exchanges feature Recent work Return 44 financial ratios and Cart, Rep Tree, LAD Tree, . . . Function based Clustering Hybrid 80.24 forecasting in fundamental index TSE-Iran Recent work Risk forecasting 44 financial ratios and DTNB, BF Tree, LAD Tree, . . . Function based clustering Hybrid 79.01 in TSE-Iran fundamental index 1338 S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 as well as fundamental data set. To do this, first through a compre- Appendix A. CAPM model hensive study, the features which can be potentially effective on risk and return were investigated. Then, after developing an b coefficient is the amount of changes in the stock return to appropriate database the preprocessing of database step was market and accounted as follow. taken. To predict the real return and risk, 20 and 15 different pre- diction algorithms were applied respectively. Then, the strength b ¼ ðCov ðMarket return stock returnÞÞ=Var ðMarket returnÞ: and weakness of each one was investigated by analyzing the size ðA1 Þ and leaves of tree algorithms or/and ‘‘If-Then Rules’’ gains of rule based algorithms. In the next step, by using hybrid feature Market expected return, shows the amount of market selection algorithm on the basis of 9 different filter algorithms return in a definite time which is gained with this formula: and function-based clustering method, important features were Pt Pt1 selected and re-prediction with selected features was performed. rm ¼ ðA2 Þ Pt1 The results show that for real return parameter, the number of effective features are usually less than the number of effective fea- In this pt = the market indicator at the end of period (for tures on risk parameter. With the help of these features, the results example 2013/12/28) and Pt1 = the market indicator at the in most algorithms were improved. In this way, this hybrid feature beginning of period (for example 2013/1/1). selection method is capable of identify effective features. The high Return without risk also can be done through this formula. accuracy of prediction results indicates that the extracted features ðpt pt1 Þ þ Dt explain the behavior of market very well and can be considered as r ft ¼ 100 ðA3 Þ pt1 a suitable database for the future research. Our findings can enable the investors to analyze the market and gain high accurate results In this pt is end of period stock price, Pt1 = beginning of per- with fewer features, and not getting confused in the market by iod stock price and Dt = benefits of stock ownership which many features which are not necessarily effective. This study is dif- has belonged to shareholder in period t. if we have capital fered from the previous ones by considering the combination of 9 increase in period of investment from savings or receivables different feature selection algorithms with function-based cluster- and cash income then the formula will change as a follow: ing algorithm. This hybrid model can enjoy the advantages of all feature selection algorithms and make a robust and accurate deci- Dt þ pt ð1 þ a þ bÞ ðpt1 þ caÞ sion. The effectiveness of our model is illustrated with the predic- r ft ¼ 100 ðA4 Þ pt1 þ ca tion of both risk and return of stocks and then analyzing the results with and without implementing of our hybrid feature selection In this a = the percent of capital increase of the receivables and algorithms. While almost none of the relevant studies in this field cash income, b = the percent of capital increase of savings, pay attentions to prediction of risk feature. Furthermore, we design c = nominal amount paid by investor to increase the capital a systematic and efficient methodology for comprehensive search- of the receivables and cash income. We use this formula for ing the potential representative features on stock market in 3 cat- calculate the rm, rf and b. egories of financial ratio, profit and loss reports, and Stock pricing models and not arbitrary choosing likely effective features. References Finally, investigating each algorithm with a feature-oriented view point indicates the factors which cause strength and weak- Araújo, R. d. A., & Ferreira, T. A. (2013). A morphological-rank-linear evolutionary method for stock market prediction. Information Sciences, 237, 3–17. ness of that algorithm. Therefore, by searching about property of Barak, S., Abessi, M., & Modarres, M. (2013). Fuzzy turnover rate chance constraints data base, we can choose a proper algorithm without implementa- portfolio model. European Journal of Operational Research, 228, 141–147. tion of all methods. This idea can be further extended not only in Bartholdy, J., & Peare, P. (2005). Estimation of expected return: CAPM vs. Fama and French. International Review of Financial Analysis, 14, 407–427. quantitative investment, but also in other field of studies where Bauer, R., Guenster, N., & Otten, R. (2004). Empirical evidence on corporate expert systems and machine learning techniques are used. governance in Europe: The effect on stock returns, firm value and performance. The limitation of this method is that collecting all data and Journal of Asset Management, 5, 91–104. Bernstein, L., & Wild, J. (1999). Analysis of financial statements (5th ed.). McGraw- information may be difficult for some real cases. Hill. Future research directions of paper include but are not limited to Brealey, R. A., Myers, S. C., & Allen, F. (2007). Principles of corporate finance (9th ed.). McGraw-Hill. 1. Combining prediction methods in the framework of fusion Breunig, M. M., Kriegel, H-. P., Ng, R. T., & Sande, J. (2000). LOF: Identifying density- based local outliers. In Proceedings of the 2000 ACM SIGMOD international models or optimize the classification algorithms by applying conference on management of data (pp. 93–104). Publishing, Dallas. some metaheuristics algorithms to improve the prediction Cao, Q., Leggio, K. B., & Schniederjans, M. J. (2005). A comparison between Fama and results. French’s model and artificial neural networks in predicting the Chinese stock market. Computers & Operations Research, 32, 2499–2512. 2. Predicting the other important variable (in addition to risk and Carnes, T. A., & College, B. (2006). Unexpected changes in quarterly financial- return) such as liquidity (Barak et al., 2013). statement line items and their relationship to stock prices. Academy of 3. Using technical features and textual information, in addition to Accounting and Financial Studies Journal, 10, 99–116. Chang, T.-S. (2011). A comparative study of artificial neural networks, and decision fundamentals features, in order to have a more comprehensive trees for digital game content stocks price prediction. Expert Systems with features and to be able to predict short term situation of stocks. Applications, 38, 14846–14851. 4. Customizing the proposed approach for the prediction of risk Chen, Y.-S., & Cheng, C.-H. (2012). A soft-computing based rough sets classifier for classifying IPO returns in the financial markets. Applied Soft Computing, 12, and return in a particular industry or investigating the accuracy 462–475. of the procedure by data from other popular stock markets, Cheng, J.-H., Chen, H.-P., & Lin, Y.-M. (2010). A hybrid forecast marketing timing such as US stock market which may result in new dimensions model based on probabilistic neural network, rough set and C4.5. Expert Systems with Applications, 37, 1814–1820. in this procedure. Dastgir, M., & Afshari, M. H. (2004). Evaluating stock pricing models on Tehran stock 5. Applying different clustering models to our feature selection exchange assets. Accounting Studies, 3, 60–94. data set and compare results by considering new feature selec- de Oliveira, F. A., Nobre, C. N., & Zárate, L. E. (2013). Applying artificial neural tion methods, such as CFS (Zhang, Hu, Xie, Wang, et al., 2014), networks to prediction of stock price and improvement of the directional prediction index – case study of PETR4, Petrobras, Brazil. Expert Systems with density based clustering (Shamshirband et al., 2014)or Applications, 40, 7596–7606. entropy-based clustering for feature selection (Lin, 2013). duda, R. o., Hart, P. E., & Strok, D. G. (2001). Pattern classification. Wiley. S. Barak, M. Modarres / Expert Systems with Applications 42 (2015) 1325–1339 1339 Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning Lewellen, J. (2004). Predicting returns with financial ratios. Journal of Financial algorithms and representations for text categorization. In Proceedings of the Economics, 74, 209–235. international conference on information knowledge management (pp. 148–155). Li, B. (2006a). A new approach to cluster analysis: The clustering-function-based Enke, D., & Thawornwong, S. (2005). The use of data mining and neural networks for method. Journal of the Royal Statistical Society, Series B (Statistical Methodology), forecasting stock market returns. Expert Systems with Applications, 29, 927–940. 68, 457–476. Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and Li, B. (2006b). Sign eigenanalysis and its applications to optimization problems and bonds. Journal of Financial Economics, 33, 3–56. robust statistics. Computational Statistics & Data Analysis, 50, 154–162. Fama, E. F., & French, K. R. (2012). Size, value, and momentum in international stock Lin, H.-Y. (2013). Feature selection based on cluster and variability analyses for returns. Journal of Financial Economics, 105, 457–472. ordinal multi-class classification problems. Knowledge-Based Systems, 37, Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global 94–104. optimization. In Fifteenth international conference on machine learning (pp. 144– Liu, H., & Setiono, R. (1996). A probabilistic approach to feature selection – a filter 151). solution. In Proceedings of the 13th international conference on machine learning Gordon, M. J. (1982). The investment, financing, and valuation of the corporation (vol. (pp. 319–327). 52). Greenwood Press Reprint. Mitchell, T. M. (1997). Machine learning. McGraw-Hill. Hall, M. (1998). Correlation-based feature subset selection for machine learning. Mukherji, S., Dhatt, M. S., & Kim, Y. H. (1997). A fundamental analysis of Korean University of Waikato. stock returns. Financial Analysts Journal, 53, 75–80. Hall, M., & Frank, E. (2008). Combining Naive Bayes and decision tables. In Ng, W. W., Liang, X.-L., Li, J., Yeung, D. S., & Chan, P. P. (2014). LG-trader: Stock Proceedings of the 21st Florida artificial intelligence society conference (FLAIRS). trading decision support based on feature selection by weighted localized Florida. generalization error model. Neurocomputing. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). Omran, M., & Ragab, A. (2004). Linear versus non-linear relationships between The weka data mining software: an update. ACM SIGKDD Explorations financial ratios and stock returns: Empirical evidence from Egyptian firms. Newsletter, 11, 10–18. Review of Accounting and Finance, 3, 84–102. Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques. The Morgan Ou, P., & Wang, H. (2009). Prediction of stock market index movement by ten data Kaufmann. mining techniques. Modern Applied Science, 3, 28–42. Hjalmarsson, E. (2010). Predicting global stock returns. Journal of Financial and Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2014). Predicting stock and stock price Quantitative Analysis, 45, 49–80. index movement using trend deterministic data preparation and machine Hofmann, M., & Klinkenberg, R. (2013). RapidMiner: Data mining use cases and learning techniques. Expert Systems with Applications. business analytics applications. Chapman and Hall/CRC. Sadka, G., & Sadka, R. (2009). Predictability and the earnings–returns relation. Holte, R. C. (1993). Very simple classification rules perform well on most commonly Journal of Financial Economics, 94, 87–106. used datasets. Machine Learning, 11, 63–90. Soliman, M.-T. (2008). The use of DuPont analysis by market participants. The Hong, T.-P., & Wu, C.-W. (2011). Mining rules from an incomplete dataset with a Accounting Review, 83, 823–853. high missing rate. Expert Systems with Applications, 38, 3931–3936. Shamshirband, S., Amini, A., Anuar, N. B., Kiah, M. L. M., Wah, T. Y., & Furnell, S. Huang, C.-F. (2012a). A hybrid stock selection model using genetic algorithms and (2014). D-FICCA: A density-based fuzzy imperialist competitive clustering support vector regression. Applied Soft Computing, 12, 807–818. algorithm for intrusion detection in wireless sensor networks. Measurement. Huang, C. F. (2012b). A hybrid stock selection model using genetic algorithms and Shi, H. (2007). Best-first decision tree learning. Annals of statistics, 2, 337–407. support vector regression. Applied Soft Computing, 12, 807–818. Svalina, I., Galzina, V., Lujic´, R., & Šimunovic´, G. (2013). An adaptive network-based Huang, C.-J., Yang, D.-X., & Chuang, Y.-T. (2008). Application of wrapper approach fuzzy inference system (ANFIS) for the forecasting: The case of close price and composite classifier to the stock trend prediction. Expert Systems with indices. Expert Systems with Applications, 40, 6055–6063. Applications, 34, 2870–2878. Tsai, C.-F., & Hsiao, Y.-C. (2010). Combining multiple feature selection methods for Kao, L.-J., Chiu, C.-C., Lu, C.-J., & Chang, C.-H. (2013). A hybrid approach by stock prediction: Union, intersection, and multi-intersection approaches. integrating wavelet-based feature extraction with MARS and SVR for stock Decision Support Systems, 50, 258–269. index forecasting. Decision Support Systems, 54, 1228–1244. Tsai, C.-F., Lin, Y.-C., Yen, D. C., & Chen, Y.-M. (2011). Predicting stock returns by Kaplan, S. N., & Ruback, R. S. (1995). The valuation of cash flow forecasts: An classifier ensembles. Applied Soft Computing, 11, 2452–2459. empirical analysis. The Journal of Finance, 50, 1059–1093. Tsai, C.-F., Lu, Y.-H., & Yen, D. C. (2012). Determinants of intangible assets value: The Knorr, E. M., & Ng, R. T. (1999). Finding intentional knowledge of distance-based data mining approach. Knowledge-Based Systems, 31, 67–77. outliers. In Proceedings of the 25th international conference on very large data Wang, J.-L., & Chan, S.-H. (2006). Stock market trading rule discovery using two- bases (pp. 211–222). Edinburgh, Scotland. layer bias decision tree. Expert Systems with Applications, 30, 605–611. Kononenko, I. (1994). Estimating attributes: analysis and extensions of relief. In Witten, I. H., & Frank, E. (2011). Data mining: Practical machine learning tools and Proceedings of the seventh European conference on machine learning (pp. 171– techniques (3rd ed.). San Francisco: Morgan Kaufmann. 182). Wu, J.-L., Yu, L.-C., & Chang, P.-C. (2014). An intelligent stock trading system using Lai, R. K., Fan, C.-Y., Huang, W.-H., & Chang, P.-C. (2009). Evolving and clustering comprehensive features. Applied Soft Computing, 23, 39–50. fuzzy decision tree for financial time series data forecasting. Expert Systems with Yu, H., Chen, R., & Zhang, G. (2014). A SVM stock selection model within PCA. Applications, 36, 3761–3773. Procedia Computer Science, 31, 406–412. Lee, W.-S., Tzeng, G.-H., Guan, J.-L., Chien, K.-T., & Huang, J.-M. (2009). Combined Zhang, X., Hu, Y., Xie, K., Wang, S., Ngai, E., & Liu, M. (2014). A causal feature MCDM techniques for exploring stock selection based on Gordon model. Expert selection algorithm for stock prediction modeling. Neurocomputing. Systems with Applications, 36, 6421–6430. Zhang, X., Hu, Y., Xie, K., Zhang, W., Su, L., & Liu, M. (2014). An evolutionary trend Levin, N., & Zahavi, J. (2001). Predictive modeling using segmentation. Journal of reversion model for stock trading rule discovery. Knowledge-Based Systems. Interactive Marketing, 15, 2–22.