2009 Fourth International Conference on Internet Monitoring and Protection Behavior-based Proactive Detection of Unknown Malicious Codes Jianguo Ding∗‡ , Jian Jin† , Pascal Bouvry∗ , Yongtao Hu§ and Haibing Guan¶ ∗ Faculty of Science, Technology and Communication (FSTC), University of Luxembourg, L-1359 Luxembourg Email:
[email protected]† School of Information Science and Technology, East China Normal University, Shanghai, 200062, P. R. China ‡ Software Engineering Institute, East China Normal University, Shanghai, 200062, P. R. China § The Third Research Institute of the Ministry of Public Security, P. R. China ¶ School of Information Security Engineering, Shanghai Jiao Tong University, Shanghai 200030, P. R. China Abstract—With the rising popularity of the Internet, the re- efforts. sulting increase in the number of available vulnerable machines, Traditional signature-based anti-virus scanner gets segments and the elevated sophistication of the malicious code itself, the of file content as the technical component. The analytical detection and prevention of unknown malicious codes meet great challenges. Traditional anti-virus scanner employs static features component is just a simple comparison between the segments to detect malicious executable codes and is hard to detect the and the signature-pattern database. This method gives birth to unknown malicious codes effectively. We propose behavior-based a very low false-positive fraction near to zero while it per- dynamic heuristic analysis approach for proactive detection of forms poorly when facing with previously unknown malicious unknown malicious codes. The behavior of malicious codes is executables or variants of existing ones. identified by system calling through virtual emulation and the changes in system resources. A statistical detection model and Current anti-virus scanner involves static heuristic to alle- mixture of expert (MoE) model are designed to analyze the viate this problem. Instead of looking for specific signature behavior of malicious codes. The experiment results demonstrate of a virus, it looks for virus behavior. Each signature is a the behavior-based proactive detection is efficient in detecting generic code sequence that represents a behavior feature and unknown malicious executable codes. a complex comparison is invited in the analytical component. However, this method also drives data form the file content as I. I NTRODUCTION the technical component and can be obfuscated by techniques Malicious code (or malware) is defined as any program such as polymorphism and metamorphism. Although wildcard (including macros and scripts) that is specifically coded to have been added to the code sequence to resolve the obfus- cause an unexpected (and usually unwanted) event on a cation problem, a high false positive fraction comes along user’s PC or a server. Typical examples include viruses, consequently. On the other hand this method depends on aided Trojan horses, Worms, Back doors, Spyware, and Adware, techniques such as unpacking, decryption and disassembly. etc. One reason for the prevalence of malicious code on This paper tries to use dynamic heuristic method to analyze today’s networks is the rising popularity of the Internet and the running behaviors of malicious codes and try to establish the resulting increase in the number of available vulnerable an automatic mechanism to assist classifying and identifying machines because of security-unaware users. Another reason unknown malicious codes. The main contributions are sum- is the elevated sophistication of the malicious code itself [3]. marized as follows: One issue raised was about the behaviour of malicious 1. The characteristic behaviors of malicious codes are code and their sources. Surprisingly, the basic functionality identified based on the behavior features with corresponding of malware has not changed much. The samples that are Win32 API calls and their certain parameters. observed today either steal sensitive information (key loggers, 2. An automatic executable behavior tracing system is password thieves, Bank Trojans), send spam mails, or can implemented to dynamically capture the behavior features we be used to launch denial of service attacks. But the real defined. development of malicious codes make themselves hard to 3. Two approaches are presented for the behavior analysis be detected and identified by obfuscation techniques. For and to establish classification strategies for proactive detec- example, polymorphic viruses would change form each time tion for malicious codes. Experiment results demonstrate that the virus infected a new victim. Metamorphic virus will change the proactive strategies are efficient in detecting previously the structure of the virus body as well as the decryption engine, unknown malicious executables. making it impossible to get a signature match [9]. The rest of this paper is organized as follows: Section Meanwhile, mapping out dark (honeypot) address spaces is 2 describes related work on malicious executables detection an emerging threat. As a result, there is a need to develop based on malicious behavior. Section 3 presents the malicious techniques that can accurately capture emerging threats, since behavior feature definition. Section 4 gives the details of the a good intelligence is a prerequisite for subsequent mitigation dynamic behavior analysis for malicious codes. Section 5 978-0-7695-3612-5/09 $25.00 © 2009 IEEE 72 DOI 10.1109/ICIMP.2009.20 TABLE I demonstrates the proactive detection strategies for malicious R ISK EVALUATION OF C OPYFILE IN 32- BIT W INDOW SYSTEM codes and the experiment results as well. Section 6 concludes and gives an outline for future works. Type of files (I) Type of directories (II) Risk ranking *.dll system-sensitive directories harmful (2) *.exe system-sensitive directories harmful (2) II. R ELATED WORKS IN DETECTION OF MALICIOUS CODES *.dll non-system-sensitive directories less harmful (1) *.exe non-system-sensitive directories less harmful (1) Related works are focused on two aspects most close to our *.txt any directory benign (0) work, the way of getting Win32 API calls and the definition of malicious behavior. [13] uses Win32 API call sequence to reflect the behavior of an executable and a PE binary parser is developed to extract parameters. The risk ranking varies with the combination of static API call sequence. It extracts the CALL instructions and type of files (I) and type of directories (II). finds their target APIs. This method depends on decryption To describe malicious behavior more accurately, the param- and disassembly and research [6] shows that incorrect disas- eters are included with Win32 API to represent the malicious sembly output is produced when confronts with control flow behaviors. 6 classes of malicious behaviors are concluded obfuscation. respectively related to files, processes, windows operation, networking, register, and windows services. [14] presents CWSandbox, which executes malware sam- ples in a simulated environment, monitors all system calls, 1) File-related behavior and automatically generates a detailed report to simplify (1) Creating PE files under system-sensitive directory. and automate the malware analyst’s task. It monitors all the (2) Copying PE files to system-sensitive directory. executed functionality. (3) Writing system-sensitive files such as PE files. Most similarity to our work is the system implemented by 2) Process-related behavior Ryuiti Koike [6]. Their approach captures Win32 API calls (1) Writing memory of other processes. of executables when running within a closed environment. (2) Creating remote thread in other processes. Relation between monitored behavior and API calls is defined. (3) Terminating other processes. However, only two malicious behavior features, file creation 3) Window-related behavior and registry change are monitored. This may be not enough (1) Hooking keyboard. to issue a verdict whether an executable is malicious. (2) Hiding window. From the view of applications, antivirus companies receive 4) Network-related behavior up to several hundred new malware samples every day. Clearly, (1) Binding and listening to port the analysis of these malware samples cannot be performed (2) Initiating http connection. completely manually. Thus an automated solution is needed 5) Register-related behavior for the detection of malicious codes. (1) Creating and Setting register key for automatic running. III. B EHAVIOR IDENTIFICATION OF MALICIOUS CODES (2) Setting register to lower security check. 6) Windows service behavior In general, the way of representing the malicious run- (1) Terminating windows update service. time behavior can be divided into two levels, the machine (2) Terminating windows fire-wall. instruction level and the operating system interface level. For (3) Opening telnet service. the first level, a sequence of CPU instructions is used. For the (4) Opening FTP session. operating system interface level, the Application Programming After many experiments and careful investigations, those Interface (API) which allows applications to exploit the service unsensitive behaviors and common behaviors for both mali- of the operating systems is used. In this paper, our research is cious codes and benign codes are eliminated. As a result, a based on Windows 32-bit operating systems. more detailed 35-dimension feature vector is finally defined Windows 32-bit executables are also known as Portable with each dimension standing for one kind of behavior. See Executables, or PE files. Taking into account that most exe- Table 2. cutables nowadays target 32-bit Windows platform and exploit the service of the operating system, Win32 API is chosen to IV. DYNAMIC ANALYSIS FOR MALICIOUS CODES represent malicious behavior. For the analysis for malicious codes, there are two broad However, the definition of malicious behavior with Win32 categories: static analysis and dynamic analysis techniques. API alone could result in that many normal Win32 API calls Static analysis is the process of analyzing a program’s are treated unjustly. The same Win32 API call with different code without actually executing it. In this process, a binary parameters may have various risk rankings. To copy a .DLL is usually disassembled first, which denotes the process of file or an .EXE file to a system-sensitive directory tends to transforming the binary code into corresponding assembler be considered malicious while copying a .TXT file to user’s instructions. Then, both control flow and data flow analysis document directory is thought to be normal and benign. Table techniques can be employed to draw conclusions about the 1 gives an example of risk ranking of CopyFile with different functionality of the program. 73 TABLE II D EFINITION OF 35- DIMENSION FEATURE VECTOR ( EVENT OPERATION ) FOR MALICIOUS CODES IN 32- BIT W INDOW SYSTEM Event ID Event name Event Operation 1 EVENT SHELL EXECUTE Shell execute hide 2 EVENT WIN EXEC Run Program hide 3 EVENT WRITE PROCESS MEMORY Write memory 4 EVENT ISDEBUG Is Program in debugger 5 EVENT CREATE REMOTE THREAD Create remote thread in another process 6 EVENT TERMINATE PROCESS Terminate process 7 EVENT ENUM PROCESS Enum process 8 EVENT COPY PEFILE Copy pe file 9 EVENT WRITE FILE Write file 10 EVENT READ EMAILFILE Read email file 11 EVENT SETFILETIME Set file time 12 EVENT DELETEFILE Delete file 13 EVENT MODIFY HOSTFILE Modify host file 14 EVENT MODIFY FILEATTR Modify file attrib 15 EVENT MODIFY FILENAME Rename file 16 EVENT CREATE PROCESS Create process 17 EVENT HOOK KEYBOARD Set keyboard hook 18 EVENT HOOK CBT Set computer-based training hook 19 EVENT HIDE WINDOW Hide window 20 EVENT CLICK KEYBOARD Click keyboard 21 EVENT CLICK MOUSE Click mouse 22 EVENT NETWORK BIND Bind port 23 EVENT NETWORK CONNECT Connect network 24 EVENT NETWORK GETHOSTBYNAME Get host by name 25 EVENT NETWORK SPECIALSOCKET Create rawsocket 26 EVENT NETWORK SENDTO Send data 27 EVENT SET REGVALUE Set reg value 28 EVENT REPLACE REG Replace reg 29 EVENT RESTORE REG Restore reg 30 EVENT LOAD REG Load reg 31 EVENT OPEN SERVICE Open service 32 EVENT CREATE SERVICE Create service 33 EVENT MODIFY SERVICE Modify service 34 EVENT NETWORK WINHTTPCONNECT Open http connect 35 EVENT SETSHARE Set share A number of static binary analysis techniques [1] [2] [7] the operating system on a real machine. In implementing the have been introduced to detect different types of malicious experiment, we use virtual machine (VMware) to execute the codes. Static analysis has the advantage that it can cover the dynamic analysis for malicious codes. complete program code and is usually faster than its dynamic To dynamically capture the malicious behavior features, an counterpart. However, a general problem with static analysis automatic executable behavior tracing system is implemented. is that many interesting questions that one can ask about a See Figure 1. program and its properties are undecidable in the general case. In the architecture, the functions of some main components In contrast to static techniques, dynamic techniques analyze are listed as follows: the code during run-time. While these techniques are non- • Auto Input: it automatically inputs PE executables into exhaustive, they have the significant advantage that only those the Sample Database. Rather than simple .exe suffix, PE instructions are analyzed that the code actually executes. Thus, structure is used to exclude non-PE executables. dynamic analysis is immune to obfuscation attempts and • VMware: it is used to run API Tracer and executables. has no problems with self-modifying programs. When using A clean snapshot of VMware is recovered every time, dynamic analysis techniques, the question arises in which when it runs a new PE executables. This aims to protect environment the sample should be executed. the VMware from infection. Running the executable in a virtual machine (that is, a vir- • VM Console: it searches the sample Database to acquire tualized computer), such as one provided by VMware [12], is the path information and transfers PE executables into a a popular choice. In this case, the malware can only affect the VMware. The transfer is realized by set share privilege virtual PC and not the real one. After performing a dynamic on the folder including PE executables. analysis run, the infected hard disk image is simply discarded • API Tracer: it is a tool that can monitor the running and replaced by a clean one (i.e., so called snapshots). Vir- Win32 API calls and translate the living process stack tualization solutions are sufficiently fast. There is almost no to API parameters. This tool is based on windows debug difference to running the executable on the real computer, technologies. and restoring a clean image is much faster than installing • Data Space Mapping: this module maps the captured data 74 Feature Vector PE Executables Space Data Space Auto Input Mapping Sample VM Console Behavior Database Database PE Executables & Commands Captured APIs with Parameters VMware API Tracer Analysis Report Fig. 1. Software architecture for behavior dynamic analysis of malicious codes into the feature vector space. Test data set: V = {v1 , v2 , · · · , vQ }, total number of The output of API Tracer is APIs with their parameters. the data set: Q, No. q sample data in data set: vq = This data space is then mapped to the feature vector space by [vq1 , vq2 , · · · , vq35 ]. Data Space Mapping module. Malicious data set: V(m) , total number of the malicious data (m) (m) Q , No. q sample set: data in malicious data set: vq = V. P ROACTIVE DETECTION APPROACH FOR MALICIOUS (m) (m) vq1 , vq2 , · · · , vq35 . (m) CODES Benign data set: V(b) , total number of the benign data To enable the further analysis for behaviors of malicious (b) (b) Q , No. q sample set: data in benign data set: vq = codes, two behavior feature vector sets are constructed for (b) (b) (b) malicious and benign samples. Both malicious and benign vq1 , vq2 , · · · , vq35 . samples are strictly checked before they are chosen to con- Meanwhile the following principle should be satisfied: struct the dataset. Malicious samples are provided by ANTIY V = V(m) ∪ V(b) laboratory, a member of CNCERT/CC. (National Computer V(m) ∩ V(b) = ∅ Network Emergency Response Technical Team/Coordination 1. Establish standard benchmarks for malicious samples Center of China). Benign samples are collected from the and benign samples Internet and have been filtered by antivirus tools. Currently, the API Tracer only targets PE executables. Non- After the experiment for both training datasets of malicious PE executables will not be included. For the benign samples, they are firstly scanned by Kaspersky [5], Norton Antivirus samples and benign samples, two standard benchmarks V (m) [10] and McAfee [8] to eliminate hidden malicious ones. and V (b) are obtained. The experiment results demonstrates Secondly, repeated samples are deleted according to their MD5 that denseness and sparseness of event statistic exists in the value. Both malicious and benign samples which are failed to final result. Thus a standardize process is given to normalize be traced without any records will not be included either. At the standard benchmarks. They can be calculated as following: last, 8823 malicious and 2821 benign samples are selected as candidates for experiment. V (m) − V (m) V (m) = (1) Both the datasets of malicious samples and benign samples S (m) are divided into two groups for training and test. See table 3. V (m) is the average of event statistic for malicious samples, TABLE III (m) (m) (m) (V −V ) DATASETS OF MALICIOUS SAMPLES AND BENIGN SAMPLES FOR S = 35−1 ; EXPERIMENT V (b) − V (b) Dataset Sum of samples Training dataset Test dataset V (b) = (2) Malicious sample 8823 5000 3823 S (b) Benign samples 2821 2000 821 V (b) is the average of event statistic for benign samples, (b) (b) (b) (V −V ) S = 35−1 . A. Statistical method for behavior analysis of malicious codes Since 35 malicious behavior events (features) are identified, 2. Classification for unknown samples we define: 75 Test datasets are used to evaluate the approaches for the iteratively, once the number of clusters C is fixed. uc,q and classification for samples codes. For unknown samples V (x) , vc are the membership degree of the pattern pq to the c- with the same experiment procedure, the Euclidean distances th cluster and the c-th cluster center respectively. dc,q is d(xm) and d(xb) will be obtained. They can be calculated as Euclidean distance between the pattern pq and the c-th cluster follows: center. Han’s model [15] uses uc,q directly as gc,q , so gc,q can be obtained in both experts training stage and test stage after we have got the cluster centers, and consequently it can be 35 (x) (m) d(xm) = V (x) − V (m) = (v − v )2 (3) seen as fixed. i i i=1 Following is experts training stage. Since the expert network 35 is a single-layer pure linear neural network, the output of the (x) (b) expert can be written as d(xb) = V (x) − V = (v − v )2 (b) i i (4) i=1 ac,q = wcT · pq + bc (8) The classification rules is as follows: or If d(xm) ≤ d(xb) , x is identified as a malicious ac,q = xTc · zq (9) sample; if d(xm) > d(xb) , x is identified as a benign sample. to facilitate calculation where wc and bc are the weights and T the bias of the c-th expert respectively, and xc = wcT , bc , The evaluation results is presented in Section 5.3. T T zq = pq , 1 . Now there only the weights in equation 5 B. Mixture of experts (MoE) model for behavior analysis of are unknown and we can construct a second order function malicious codes according to an approximate steepest decent algorithm and get Another mixture of experts (MoE) model is imported for the weights upgrading rule using Least Mean Square (LMS) the proactive analysis for both malicious samples and benign algorithm as samples. ∆xc = 2α · uc,q · e · zT (10) The architecture of MoE is denoted as Figure 2. where α is a learning rate and e = tq −yq is the error between the target tq and the predicted yq . The final weights of the experts are obtained using equation 10 iteratively. Thus, the training process is over. During the test process an unknown pattern is input to the gating network and we can get the corresponding membership degree with equation (6) and then the final output can be obtained using equation (5). In our experiment, we set the number of gating networks as 3. That means the training datasets for both malicious samples and benign samples belong to 3 groups to some extent and are trained partly in separated neural networks. The training iterations for the whole neural network is set to 500. The Fig. 2. The architecture of MoE detailed algorithm and parameter selection strategies can be The output of the system can be concluded as a function found in our previous work [4]. C C. Experiment results yq (pq ) = (gc,q (pq ) · ac,q (pq )), (5) In the experiment, we use a VMware Workstation 5.5.3 c=1 build on a desktop with Intel Core 2 Duo CPU E4400 @ where gc,q (pq ) is a membership degree of the q-th pattern pq 2GHz, 2GHz. For statistical model, the training speed is to the c-th cluster calculated with FCM algorithm and ac,q (pq ) 0.3054 seconds, the test speed is 0.0435 seconds. For MoE is the output of the c-th expert network. model, the training speed is 68.2880 seconds, the test speed During the gate training stage, C cluster centers can be is 0.3882 seconds. obtained using the formulas To describe the experiment result, we use traditional statis- 1 tical definitions on the error of the evaluation. uc,q = C 2 (6) True Positive (TP): Ratio of malicious executables are dc,q m−1 di,q correctly classified to be malicious. i=1 True Negative (TN): Ratio of benign executables are cor- and rectly classified to be benign. Q um c,q · pq False Negative (FN): Ratio of malicious executables are q=1 vc = (7) incorrectly classified to be benign. Q um False Positive (FP): Ratio of benign executables are incor- c,q q=1 rectly classified to be malicious. 76 Thus TP+FN=1, TN+FP=1. into a preliminary detection system for antivirus systems. The experiment finally produces results for statistical ap- The proposed solutions will improve the efficiency greatly proach and MoE approach for both training dataset and test in current manual based mechanism for unknown malicious dataset. They are demonstrated in Table 4 and Table 5. codes detection. For future work, we intend to do experiment in a larger TABLE IV D ETECTION RATE (%) FOR BEHAVIOR - BASED DETECTION IN BOTH dataset, say 100,000 malicious samples, and try to improve the MALICIOUS SAMPLES AND BENIGN SAMPLES ( TRAINING DATASET ) detection rate. Another work will deal with the noise which may come form the operating system or other application’s Approaches TP FN TN FP interference in the detection system. In the experiment, some Statistical method 98.00 2.00 56.20 43.80 MoE method 79.00 21.00 100.00 0.00 malicious codes are ignored due to the unavailability of the behavior features for certain malicious codes, or they destroy themselves during the behavior detection. An improved ex- periment environment should be included for further behavior TABLE V D ETECTION RATE (%) FOR BEHAVIOR - BASED DETECTION IN BOTH detection for malicious codes. MALICIOUS SAMPLES AND BENIGN SAMPLES ( TEST DATASET ) ACKNOWLEDGEMENT Approaches TP FN TN FP Statistical method 96.01 3.99 65.12 34.88 Many thanks to ANTIY laboratory for their provision of the MoE method 75.20 24.80 99.00 1.00 malicious sample codes and their cooperation in the experi- ment. This work is partly supported by ERCIM (the European Research Consortium for Informatics and Mathematics), the Currently, the industry claims a 70%-80% (TP) detection Shanghai Committee of Science and Technology, China (Grant rate of new and unknown viruses with heuristic scanning, No. 06511502 and No. 08ZR1407200), project MOE-INTEL- which is pretty good considering the complexity of the prob- 08-11 and the financial support by the FNR (Fonds National lem [9]. The statistical method gives the detection rate high to de la Recherche) of Luxembourg. 96.01% (TP), the MoE method gives the detection rate to 75% (TP). Both methods are efficient comparing to the industry R EFERENCES claims. In antivirus society, the FN and FP are often used to evaluate [1] Christodorescu,M., Jha, S.: Static analysis of executables to detect mali- cious patterns. In Proceedings of the 12th USENIX Security Symposiu, the efficiency for antivirus tools. 2003. Actually, FN is more important than FP. Because higher [2] Christodorescu,M., Jha, S., Seshia, S., Song, D.,Bryant, R.: Semantics- FN inclines to bring more threats to system, but a higher aware malware detection. Proceedings of 2005 IEEE Symposium on FP may bring less threats to system. Thus from the view of Security and Privacy, pages 32-46, 2005. [3] Fernandez, J.M.; Bureau, P.-M.: Optimising malware. 25th IEEE Interna- security, the statistical method (FN=3.99%) is optimal to MoE tional Performance, Computing, and Communications Conference, 2006. method (FN=24.8%). However, the high FP (FP=34.88%) in IPCCC 2006. pages 10pp.-586 , 2006. statistical method might be annoying for an end user. Thus the [4] Jian Jin, Guo-Xing Huang, Jianguo Ding, Yong-Tao Hu: Parameters modeling for a modified mixture of experts. Proceedings of 7th World integration of two methods will improve the detection rate. Congress on intelligent Control and Automation, pages 4024-4029, 2008. Theoretically, the detection rate can be improved by tech- [5] http://www.kaspersky.com/ nologies updating, but it can not get to 100%. In current [6] Ryuiti Koike, Naoshi Nakaya and Yuji Koi: Development of System for the Automatic Generation of Unknown Virus Extermination Software. complex network environment and with the increasing diver- Proceedings of the 2007 International Symposium on Applications and sified softwares, some codes (executables), even if this kind the Internet, page 8-8, 2007. of codes is extremely few, have threats to someone (or some [7] Kruegel, C., Robertson,W., Vigna, G.: Detecting Kernel-level rootkits through binary analysis. Proceedings of the 20th Annual Computer system), but not to another one (or another system). Thus Security Applications Conference, pages: 91-100 2004 the definition for malicious codes may be improved with the [8] http://www.mcafee.com/ updated application environments. [9] Jay Munro, Antivirus Research and Detection Techniques. Online publi- cation, http://www.extremetech.com/article2/0,1558,325439,00.asp, 2002. VI. C ONCLUSIONS AND FUTURE WORK [10] http://norton-antivirus-online.com/ [11] Ulrich Bayer, Andreas Moser, Christopher Kruegel and Engin Kirda: In this paper, we propose a malicious executable detecting Dynamic Analysis of Malicious Code. Journal in Computer Virology. Volume 2, pages 67-77, 2006. method using 35-dimension feature vector and define each [12] VMware: server and desktop virtualization, 2008. feature with corresponding Win32 API call and their certain http://www.vmware.com/ parameters. A software system is implemented to dynamically [13] Xu J-Y. , Sung A. H. , Chavez P. , Mukkamala S.: Polymorphic extract the features. Statistical method and MoE methods are Malicious Executable Scanner by API Sequence Analysis. Proceedings of the Fourth International Conference on Hybrid Intelligent Systems, presented to perform the proactive detection for malicious pages 378-383, 2004. codes. Experiment results demonstrate that the methods are [14] Carsten Willems, Thorsten Holz, Felix Freiling: Toward Automated efficiency in proactive detection for unknown malicious codes, Dynamic Malware Analysis Using CWSandbox. IEEE Security and Privacy, vol. 5, no. 2, pp. 32-39, Mar/Apr, 2007 even if they have different efficiency with each other. For [15] B. Han and S. Wang. Fuzzy neural network system modeling. Ship application, it’s possible to integrate the presented methods Science and Technology, pages 51-55, 1998. 77