JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 20, 783-800 (2004) Short Paper_________________________________________________ Case Instance Generation and Refinement for Case-Based Criminal Summary Judgments in Chinese* CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO Department of Computer Science National Chengchi University Taipei, 116 Taiwan E-mail:
[email protected]Researchers have applied case-based reasoning to legal informatics in recent decades, yet real-world applications of case-based systems have been hampered by the need to manually create the case instances. To alleviate this problem, we propose algorithms for automatically generating and refining case instances from real-world judgment documents for criminal summary judgments. Our algorithms attempt to extract important legal information from the documents of past lawsuits to build case instances, and then refine these case instances by merging similar cases and removing relatively irrelevant information from the cases. Experimental results, obtained from applying the resulting cases to classifying real-world lawsuits, indicate that our algorithms are viable for assisting the task of building case databases for case-based legal reasoning systems. Keywords: Chinese legal information system, case-based reasoning, weighted k-nearest neighbor, text mining, artificial intelligence, rule-based reasoning 1. INTRODUCTION In recent years, researchers have applied artificial intelligence techniques to a variety of applications in the legal domain, including computer-assisted sentencing [23]; drafting [3], abstraction [21], and classification [22] of legal documents; and education of legal practitioners [2]. In implementing these applications, researchers have employed such knowledge representation formalisms as artificial neural networks [20], case-based reasoning [15], conceptual networks [16], and decision trees [5]. Inspired partially by recent progresses in legal informatics, and partially by increasing availability of Chinese legal documents on the Internet [1], there have been burgeoning interests in Chinese legal information systems in recent years [9], and Chang et al. employ human-provided cases in case-based reasoning (CBR) for classifying legal documents in Chinese [8]. Among the different approaches to legal informatics, the CBR techniques have atReceived March 24, 2003; revised August 14, 2003; accepted October 30, 2003. Communicated by Hsu-Chun Yen. * This manuscript contains material reported in [17, 18]. 783 784 CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO tracted special attention of many researchers [5].1 Despite this popularity in the research community, Brüninghaus and Ashley point out that relying on manually constructed cases constitutes a barrier for applying CBR-based systems to real-world problems [5, 6]. Thus, they apply machine-learning methods to abstracting entities in legal documents for automatic indexing of cases [6]. Moens et al. employ an array of text analysis techniques for automatic extraction of cases [21]. Weber extracts information about case attributes using text-template mining algorithms [25]. In this paper, we report results of applying machine-generated cases in a CBR-like system for classifying criminal summary judgments in Taiwan. Currently we process lawsuits that involve one defendant who is accused of one of the 12 frequently committed crime types. Given a judgment document (判決書), we sift its contents for potentially useful information, and use the results to build a corresponding case. We then apply the constructed cases to classifying indictment documents (起訴書) based on the description provided by the public prosecutors (檢察官). Although the amount of these constructed cases remains manageable at the current stage, it is just a matter of time that we will have a large number of raw cases. We must find ways to construct fewer and more representative cases from these raw cases so that we can conduct inference effectively and manage case databases efficiently. This is possible as there are similar lawsuits in practice, so we do not have to keep a case instance for each lawsuit. Moreover, the contents of judgment documents may contain relatively unimportant information for making judgments, so there exists room for refining the case instances. To this end, we design a clustering algorithm that merges similar cases and removes subsumed cases. We also design an algorithm for removing supposedly irrelevant information from the case instances. We evaluated our classifiers with real-world lawsuits, and the classifiers offered competitive performances. When we built a case instance for each of the past lawsuits, the classifier that used machine-generated cases performed slightly inferior to a classifier that used human-provided cases. The former achieved a correct rate that was just 6% below that of the latter, while both rejected almost the same amount of test lawsuits. Experimental results also indicated that our algorithms were able to refine case instances without significantly sacrificing the performances of the classifiers. When we reduced the number of case instances by more than a half (and the storage space of the case databases by two thirds), the classifier that used the refined cases rejected about 10% of the test lawsuits, and performed worse than the classifier that used human-provided cases by about 10% in precision and recall. Section 2 provides more information about the criminal summary judgments in Taiwan and the source of our data for constructing and evaluating our case instances. Section 3 describes definitions of human-provided rules and cases in our system, and explains how we apply them to classifying lawsuits. Section 4 discusses our algorithms for generation and applications of machine-generated cases. Section 5 elaborates on our methods of refining the case instances for improving inferential efficiency. Section 6 presents our experimental setups and analyzes the results. Finally, we compare our approach with more related work, and discuss possible extensions of our work. 1 Because the word case can refer to both legal cases in conversation and cases in case-based reasoning, we use lawsuits for legal cases and cases for the data used in case-based reasoning. When the context provides sufficient clues for disambiguation, we use case for both senses. CASE-BASED REASONING FOR CRIMINAL SUMMARY JUDGMENTS IN CHINESE 785 2. BACKGROUND 2.1 Chinese Legal Document Processing The following Chinese excerpt from a judgment document describes a drunk driver who caused a traffic accident. For privacy reasons, we use “O” in the text. 吳OO於民國九十年十月二十七日上午十時十分許,在某KTV店內服用酒 類,致其反應能力降低,已不能安全駕駛動力交通工具後,仍駕駛車號HY- 1234號自用小客車沿板橋市文化路往台北方向行駛,在行經臺北縣板橋市 文化路與站前路路口時,撞及自對向車道駛來,欲左轉站前路,由林OO所駕 駛之Z3-5678號自用小客車,嗣經警方處理,對吳OO施以酒精測試, 其測定值為0.八五MG/L,始循線查知上情。 Despite the myriad of research work on Chinese information processing, there are relatively little work tailored for Chinese legal information. Lai and Huang use a small Chinese legal corpus in demonstrating the applicability of the Dependency Grammar for annotating Chinese documents [14]. Brown builds the CHINATAX system for inference problems that are related to the Commercial law of China [4]. For segmenting Chinese text, we resort to strategies commonly used in Chinese information retrieval systems. We segment consecutive characters into words with the help of a dictionary, and prefer the longer matches of strings. Specifically, we employ HowNet [13] for determining the word boundaries in Chinese legal texts. To compensate the fact that HowNet is not designed particularly for the legal domain, we also customize HowNet by adding legal terms to it in our experiments. 2.2 Data Source We select 12 relatively frequent classes of the criminal summary judgments in our experiments. For easier identification, we encode each class with a code: Offences against Public Safety (公共危險罪, C1), Offences against Morals (妨害風化罪, C2), Offences of Gambling (賭博罪, C3), Offences of Causing Bodily Harm (傷害罪, C4), Offences of Larceny (竊盜罪, C5), Offences of Misappropriation (侵占罪, C6), Offences of Receiving Stolen Property (贓物罪, C7), Offences against Chattel Secured Transactions Act (違反動產擔保交易法, C8), Offences against Statute for Narcotics Hazard Control ( 違反毒品危害防制條例 , C9), Offences against Electronic Game Arcade Business Management Act (違反電子遊戲場業管理條例, C10), Offences against Child and Juvenile Sexual Transaction Prevention Act (違反兒童與少年性交易防制條例, C11), and Offences against Statute Governing the Relations between the People of the Taiwan Area and the Mainland Area (違反台灣地區與大陸地區人民關係條例, C12). We use C13 to denote all lawsuits not belonging to the 12 target classes. We acquired the documents of three consecutive seasons from the court where one of the authors serves as a judge. The majority of these real cases, but not all, belong to the 12 target classes. We used the lawsuits occurred in the first season as the training data for computers to generate and refine cases, and used the lawsuits occurred in the second and the third seasons as the test data in the experiments. The collected lawsuits reflected 786 CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO the frequencies of lawsuits in real life. Evidently, some crime types are more frequent than others, and the distribution over the crime types was not as uniformly distributed as one might like to have for evaluation of algorithms. Table 1. Statistics of case quantities. Class C1 C2 Training data 158 26 Test Data 271 44 C3 30 87 C4 C5 C6 14 99 15 33 243 46 C7 9 15 C8 15 40 C9 C10 C11 C12 C13 19 16 19 9 74 73 23 52 23 146 Each document in our collection contains several sections. The indictment documents and the judgment documents have similar structures. Both contain a header section that describes the information about the court and the prosecution/judgment date, a section that describes personal information about the defendants, a section that describes the alleged/confirmed criminal activities (SACA), and a section that describes the alleged/judged offended law articles and sentences. We extract SACA from the documents for training and testing in our work. 3. HUMAN-MANAGED CLASSIFIERS Since case classification is an ordinary job in the courts, we may ask human experts to specify rules and cases for case classification. In section 6, we compare the performances of these human-provided cases and machine-generated cases. 3.1 Keyword-based Rules: Syntax, Semantics, and Applications Fig. 1 shows the syntax of our rules for judging prosecution reasons. The line numbers are given for explaining the semantics, but not part of the rules. The syntax comprises of four parts. The first part occupies only the first line, and contains the prosecution reason that will be assigned to the test lawsuit if the rule is fired. The second part occupies only the second line, and contains the threshold that is used to determine whether the rule should be fired. The third and optional part, which spans between lines 3 1. 2. 3. 4. …… 5. 6. 7. …… 8. prosecution reason activation threshold no taboo-phrase-1 taboo-phrase-n event indicative-phrases-1: weight-1 indicative-phrases-m: weight-m Fig. 1. Syntax for rules and cases. CASE-BASED REASONING FOR CRIMINAL SUMMARY JUDGMENTS IN CHINESE 787 and 5, specifies the phrases that will prohibit the rule from being fired if any of them appears in the indictment documents. Line 3 is included here to indicate the beginning of the list of taboo phrases. The fourth part, which spans between lines 6 to 8, specifies the phrases that may activate the rule. Each indicative phrase has an associated weight that encodes its strength for activating the rule, and phrases listed in the same line have the same weights. Similar to line 3, line 6 signifies the beginning of the indicative lists. The main purpose of applying the human-provided rules and cases is to use the experimental results as a reference for comparing the effectiveness of machine-generated rules and cases. Hence, we allow human experts to assign the weights freely as along as the rules and cases conform to the syntax and make sense to the experts. The case classification task boils down to string matching in a rule, given that we are not doing in-depth semantic analysis of Chinese documents yet. We examine whether the required and the forbidden strings appear in the SACA in the indictment documents for the classification task. Because Chinese is flexible in noun and verb generation, it is possible that public prosecutors use slightly different phrases to describe the same concept. For instance, one may use “危害他人生命” and “危害他人性命” to describe the concept of “endanger other’s life.” Although these character strings are different, they actually represent the same meaning. We cope with this problem by comparing strings with their common substring. Given the previous two Chinese strings, we can apply a simple variant of the LCS algorithm [12] to find their common substring “危害他人命”, and returns 5 as the output. Since LCS employs the concept of dynamic programming, we can efficiently conduct string matching for selecting a rule to apply. If the matched substring represents a sufficient portion of a listed string in the rule, we assume that the listed string exists in the SACA. We set a threshold for deciding whether the substring can be considered as a match, and it is set to 75% in our experiments. The selection of this threshold was based on results of an experiment that used the training data as the test data. Once we complete the string-matching step, we may fire the rules that receive the highest scores, when we apply the weighted k-Nearest-Neighbor (WkNN) [10] principle that we explain in section 4.2. Notice that if any forbidden string of a rule appears in the SACA, the rule will not be fired no matter what. For all other candidate rules, we accumulate the scores of the matched strings. During the accumulation, we allow each listed string contributes only once to the scores so that duplicated strings will not bias the final decisions. Note that, although the syntax for our rules allows different weights for terms, we assign the same weights to all terms in our rules. 3.2 Case-based Method The syntax of our cases is the same as that of our rules. The semantics is also similar, except that the ordering of strings in the fourth part of case specification must be observed, and that the weights for terms may be different. The ordering represents the presumed sequence of events, so is very important for decision-making using cases. As a result of this required ordering of events, we might not credit scores to all matched strings. The relative locations where the strings appear in the SACA determine whether the word will count. 788 CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO 4. INSTANCE-BASED LEARNING Building rule and case databases by human experts may provide better classification quality, but the time costs may become unacceptable when we confront real-world problems. Since the judgment documents contain the descriptions of criminal activities and the resulting judges’ decisions, we can conduct supervised learning by extracting relevant information from the judgment documents for constructing case instances. We use these machine-generated case instances to classify the test lawsuits later. 4.1 Instance Generation There are two alternatives for us to generate instances, and the main difference is whether we generate cases or rules. Before going into the generation step, we remove part of the given text during the preprocessing step. Treating each character string that is separated by punctuations as a unit, we remove units that contain two special sets of characters. The first set is {年, 月, 日, 時, 分}, and the second set is {市, 縣, 路, 村, 里, 段, 巷, 弄, 號}. The first set carries information about time, and the second carries information about locations. The confirmed criminal activities for cases that involve only one defendant who committed only one crime typically occur during a short interval. Hence, the time stamps of the activities are not very important. We remove the location information based on the same reason. After the preprocessing, we remove commas and periods from the given text, use a dictionary to segment the character strings, and keep only words with more than two characters. Given the Chinese text that appeared in section 2.1, we create the character string in the first box of Fig. 2, and then create the word list in the second box. We cannot explain the meaning of these Chinese texts clearly here, but the passage still describes the main criminal activities, i.e., driving under the influence of alcohol. 在某KTV店內服用酒類 致其反應能力降低 已不能安全駕駛動 力交通工具後 撞及自對向車道駛來 欲左轉站前路 由林OO所 駕駛之Z3-5678號自用小客車 嗣經警方處理 對吳OO施 以酒精測試 其測定值為0.八五MG/L 始循線查知上情 內服 反應 能力 降低 不能 安全駕駛 動力 交通工具 車道 左轉 駕駛 自用 客車 警方 處理 施以 酒精 測試 測定 Fig. 2. Generating an instance of the first type for C1. The word “安全駕駛” is not listed in the original HowNet dictionary, but is included as a word in our domain-dependent lexicon, so is identified as one word in our system. We interpret this list of keywords as an ordered list, and use it as a case instance for C1. The second method for generating decision criterion is similar. We adopt all steps that lead to the generation of the word list. The only difference is that we interpret the word list as an unordered list when we apply these instances for classification. We do not CASE-BASED REASONING FOR CRIMINAL SUMMARY JUDGMENTS IN CHINESE 789 need to store repeated words when we compute the word list, if we would apply instances of this type only for classification. However, since we will merge instances as we discuss in section 5.1, we must keep the repeated words in instances for collecting statistics of word frequencies. Notice that instances of this type are just like the rules discussed in section 3.1, where we use a bag of keywords without concerning their order. Unlike cases and rules described in section 3, we do not need to assign weights to the words in machine-generated instances. This is because we apply machine-generated instances using a variant of WkNN that we explain right away. 4.2 Instance Application Given a set of machine-generated instances, we can apply the WkNN principle for classification. To this end, we must define a measure for “similarity” so that we can establish the notion of nearest neighbors of the test lawsuit. We define different similarity measures for both types of machine-generated instances. Instances of the first type are lists of ordered keywords. Hence, we convert the SACA of the test lawsuit into an ordered word list using the same method for generating instances of the first type. This ordered list, X, is then compared with each prior case, Y. Taking the order of keywords into consideration, the comparison finds the number of common keywords that appear in the same order in both X and Y. Let OCW be a function that returns the number of ordered, common words in its input word lists, and Counts be a function that returns the number of words in its input word list. OCW(X)/Counts(X) signifies the similarity between the common substring and X, and OCW(Y)/Counts(Y) carries a similar interpretation. Since Counts(X) and Counts(Y) may be different, it would be arbitrary if we choose only one of these ratios as the similarity. Hence, we take their average, and the similarity between the test lawsuit and a prior case is defined as follows. s1 ( X , Y ) = 1 OCW ( X , Y ) OCW ( X , Y ) + 2 Counts( X ) Counts(Y ) We define the similarity measure for instances of the second type analogously. Let UCW be a function that will return the number of unordered common words in its input word lists. When the SACA of the test lawsuit, X, contains repeated words that also appear in the prior case, Y, the repeated words will be counted as many times as they appear in the SACA. In contrast, repeated words in Y are counted only once. The similarity measure for instances of the second type follows. s 2 (X , Y ) = 1 UCW ( X , Y ) UCW ( X , Y ) + 2 Counts( X ) Counts(Y ) (1) Our WkNN method is a variant of the weighted nearest neighbor methods [10], and Thompson applies another variant of kNN for categorization of Case law [24]. We do not set a static k when we use kNN for classification. Instead, we set a static threshold for dynamically determining this k. Only prior cases whose similarity measures are 0.3 or larger could cast votes on the class of the test lawsuit, and we assign the lawsuit to the 790 CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO class that receives the most votes. If there were more than two such winning classes, we compute the total of the similarity scores of all voting cases in the competing classes. The class that has the highest total score wins. Employing the minimal threshold for voting allows our systems to classify test lawsuits into C13, and affects the rejection rate that is explained in section 6.1. Had we set this threshold to 0, all prior cases would be allowed to vote on the crime type of the test lawsuit, and we would classify all test lawsuits into one of the 12 target crime types. Doing so will be unreasonable because we knowingly test our systems with lawsuits that did not belong to the 12 target classes. If all prior cases were not sufficiently similar to the test lawsuit, we should allow our systems to express this abnormality. We set the minimal threshold to prevent our systems from making random guesses. We select the value of this threshold based on experimental results obtained from an experiment that used the training cases as the test data. 5. CASE DATABASE REFINEMENT One problem of our instance-generation methods is the monotonically growing size of the case database. If we keep adding new cases into the case database, we have to wait a long time for classifying a new case for case comparison without smarter ways for efficient case retrieval [21]. We propose two other alleviating methods in this section. 5.1 Instance Clustering Clearly some of past lawsuits may be very similar to each other, so are case instances generated from them. Hence not all of the generated case instances are required, and, if we can identify and keep those that are more representative than others, we may achieve the same accuracy of classification at reduced computational costs. To this end, we must define a measure for “similarity” for merging instances. Since our original instances are already tagged with the crime classes, i.e., the prosecution reasons, we can just merge instances that belong to the same classes. We merge two instances, X and Y, using the Merge2Instances procedure, where π is a percentage threshold selected by the instance database manager. Procedure Merge2Instances(X,Y) if ((Counts(OCW(X,Y)) ≥ π*Counts(X)) and (Counts(OCW(X,Y)) ≥ π *Counts(Y))) { Remove X and Y from the instance database; Add OCW(X,Y) into the instance database; } else if ((Counts(OCW(X,Y)) < π*Counts(X)) and (Counts(OCW(X,Y)) ≥ π*Counts(Y))) Remove Y from the instance database; else if ((Counts (OCW(X,Y)) ≥ π*Counts(X)) and (Counts(OCW(X,Y)) < π*Counts(Y))) Remove X from the instance database; When X and Y are really similar to each other, their common part will constitute a significant portion of both. In this case, we replace both cases with their common portion. If the common portion exceeds a significant portion of just one of original cases, say Y, we can remove Y and keep X that still carries the common information. CASE-BASED REASONING FOR CRIMINAL SUMMARY JUDGMENTS IN CHINESE 791 To merge all instances in the same class, we temporarily assign an identification code to each case. Let IDB be the instance database that contains the instances being considered to merge, I[j] be the jth instance in the IDB, and S be the current number of instance in the IDB. We use the MergeAllInstances procedure to merge all instances in the IDB. Notice that S may decrease as we remove instances from the IDB when we call Merge2Instances. Procedure MergeAllInstances(IDB) S = current size of the IDB do { for j = 1 to S-1 for k = j+1 to S Merge2Instances(I[j], I[k]); } while (new cases were added to the IDB); There are two alternatives for merging instances of the second type. We collect statistics about the keyword frequencies in the learned instances, and keep those keywords that are neither too frequent nor too infrequent. This method smacks of the term-frequency-inverse-document-frequency used in the information retrieval community [19]. Let n(i) be the number of instances of crime type Ci. Also we assume that each distinct keyword contained in these n(i) instances is assigned an ID number, and that k(m, i) be the total number of times the mth distinct keyword appears in these n(i) instances. We define the average occurrence frequency (AOF) of the mth keyword in cases of class Ci by AOF(m, i) = k(m, i)/n(i). We merge all instances of Ci into just one instance that contains such keywords that meet the following criterion. 0.5 ≤ AOF(m, i) ≤ 1.5 Since our current study covers only the 12 crime types, we will have one prototypical instance for each of the crime type. Notice that this is almost like that, for each crime type, we mechanically generate one rule using the syntax and semantics defined in section 3.1, except that we do not generate the rule threshold and the term weights. An alternative for this very extreme strategy is to take back some of the removed instances with the help of the 12 typical instances. We recover the instances whose keyword list covers no more than 50% of the keywords of the typical instance of the same class. We then use the typical and recovered instances in the experiments. 5.2 Instance Refinement Removing irrelevant keywords in case instances provides another chance for improving system efficiency. Using the AOF values just defined, we remove the gth keyword, from case instances of Ci, that meets the following two conditions, for a selected τ : (1) AOF(g, i) < τ, and (2) the gth keyword appears in case instances of Cj, for a j ≠ i. In other words, we remove keywords that are both relatively infrequent and may suggest multiple prosecution reasons. Using a bigger τ allows us to remove keywords from the case instances more aggressively. We will set τ to some possible values, i.e., 0, 0.5, 1, 1.5, and 2, in our experiments to get more insights into this intuition. 792 CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO We can combine the applications of clustering cases and removing irrelevant keywords in different ways. Removing irrelevant keywords may occur before or after merging similar cases. In addition, when we remove keywords after merging cases, the timing of calculating the AOF values influences the performance of the resulting system. We can calculate the AOF values before or after instance clustering takes place. Calculating AOFs before merging cases gives us AOFs that will not be affected by the clustering operations, and calculating AOFs afterwards makes the AOF values sensitive to the results of the clustering operations. One may argue that using the former alternative is more reasonable because one should determine the relevancy of keywords using their frequencies in the original data. Others may argue otherwise because multiple occurrences of keywords in similar cases should not count, and the relevancy should be determined dynamically based on the merged cases. We will compare the performances of these alternatives experimentally. 6. EXPERIMENTAL RESULTS In the preceding sections, we discuss several possible decision factors for constructing a CBR-like system for the classification task. We summarize these factors, and show the structure of our experiments in Fig. 3. The first factor is whether we use rules or cases, and the second is whether we build the rules and cases by human experts or algorithms. If we algorithmically generate cases, we also segment Chinese character strings into word lists. The main difference between cases and rules is whether the orders of the words matter in applications. We also have to consider how to merge the instances. Specifically, we must determine the percentage threshold, π, in Merge2Instances. Similarly, we have picked two particular ways to merge the keyword-based rules in section 5.1. Once we have a case or a rule database, we may apply the WkNN principle for classification. Rectangles in Fig. 3 show the distinguishing features of the experiments, and ovals show the identification codes of the experiments when we discuss the experimental results. In E4, E5, E6, E7, E8, and E9, we set π to 0.7, 0.6, 0.5, 0.4, 0.3, and 0.2, respectively. We evaluated the generated instances by applying them to classifying test lawsuits. In the experiments, we let the classifiers choose to put the test lawsuit into the “unknown” class, which represented the fact that the classifiers knew that the test lawsuit did not belong to any of the 12 target crime types. Design Factors Keywords +Segment Cases Human +Segment +Merge -Merge Human +Merge -Merge E1 OneRule ManyRules π=0.7 to 0.2 step –0.1 E3 E12 E11 E10 E2 E4, E5, E6, E7, E8, E9 Fig. 3. Structure of our experiments. CASE-BASED REASONING FOR CRIMINAL SUMMARY JUDGMENTS IN CHINESE 793 6.1 Performance Measures The evaluation was based on the standard definitions for precision and recall [19]. For each crime type, Ci, i = 1, 2, …, 13, we computed the precision and recall rates, pi and ri, respectively. Upon collecting the raw experimental results, we calculated these performance measures with the help of confusion matrices [19]. pi = number of test cases correctly classified as Ci ; total number of test cases classified as Ci ri = number of test cases correctly classifed as Ci number of test cases that actually belong to Ci One may consider that precisions are more important than recalls or vice versa. If the classifiers put a lawsuit in the “unknown” class, human experts could classify these unclassified cases. Should the classifiers misclassify a lawsuit, human intervention would be required to detect and correct the errors. We can combine and weight precisions and recalls using the F measure. The F measure, fi(β) = ((β2 + 1)piri)/(β2pi + ri) [19], for each criminal type C, is a nonlinear combination of the corresponding precision and recall. When β is 1, pi and ri are weighted equally; when β approaches infinity and zero, fi approaches ri and pi, respectively. In the following reports, we set β to 1. Table 2 shows the precisions and recalls when we used human-provided cases and rules. Notice that we do not include the “%” sign for precisions and recalls in this table. The left portion of the EXPID column indicates whether the statistics are for cases or rules by the codes used in Fig. 3, and the right portion indicates whether the statistics are precisions (P) or recalls (R). Table 2. Detailed performance statistics for rules and cases. EXPID C1 C2 E1-P 100 100 E2-P 100 97 E1-R 94 93 E2-R 83 77 C3 94 90 98 84 C4 67 100 91 48 C5 98 99 97 59 C6 81 25 83 96 C7 C8 100 100 100 100 47 85 53 90 C9 99 98 96 86 C10 C11 C12 100 98 88 50 100 83 78 100 91 100 100 87 C13 76 52 88 68 When comparing two approaches for constructing classifiers, we found that a classifier seldom outperformed the other for all crime types, although one may outperform the other most of the time. Data in Table 2 support this observation. Comparisons between E1 and E2 suggest that cases are better than rules for classifying most crime types. In addition, because descriptions of some crime types, e.g., C5, C6, and C7, can be difficult to tell apart using general rules or cases, the current version of human-provided cases and rules did not perform very well in all crime types. Due to this reason and the page limits on this paper, we will not examine the precision, recall, and F measures for individual crime types henceforth. Instead, we analyze their averages. We report the arithmetic average AP and the weighted average WP of precisions. When computing WP, we weight individual precision pi by the number of CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO 794 cases, ci, that belong to the ith crime type in the test data. The case quantities are shown in Table 1. We define other averages, i.e., AR, WR, AF, and WF, analogously. AP = 13 13 13 Â Â Âc 1 pi ; WP = ci pi 13 i =1 i =1 j i =1 Using the arithmetic averages is tentative to say that all crime types are equally important. The relative frequencies of crime types do not matter. In contrast, using the weighted averages presumes that all individual lawsuits are equally important. To provide a different aspect of evaluation, we also use the rejection rate and correct rate as the performance measures. By confining the test lawsuits to those that belong to the 12 target types, we define the rejection rate (RR) as the portion of test lawsuits that the classifiers reject to classify. The correct rate (CR) is the portion of correctly classified test lawsuits among those test lawsuits that are classified. 6.2 Merging Similar Cases Table 3 shows the statistics for the experiments that are shown in Fig. 3. The SIZE column shows the number of cases or rules used in the corresponding classifiers. The parenthesized numbers in the EXPID column show the values of the parameter π used in Merge2Instances. Hence, the sizes decrease with the increasing π . Table 3. Experimental results for lawsuit classification. EXPID SIZE AP E1 24 92 E2 12 84 E3 428 84 E4 (0.7) 282 86 E5 (0.6) 214 84 E6 (0.5) 133 83 E7 (0.4) 91 84 E8 (0.3) 47 73 E9 (0.2) 18 59 E10 428 75 E11 337 63 E12 12 62 AR 88 79 79 80 80 74 63 53 45 68 66 63 AF WP WR WF CR 89 94 93 93 98 78 88 76 79 86 79 87 87 86 92 81 90 88 88 94 81 89 88 88 95 76 88 83 84 96 68 86 77 78 95 56 80 56 60 92 43 73 42 42 79 67 80 80 76 88 62 75 79 74 86 56 76 61 58 64 RR 4 10 3 5 6 14 21 44 56 1 1 5 The weighted averages make the performances of all classifiers look better than the arithmetic averages would. This is because, most of the time, the classifiers achieved higher performances in crime types that include relatively more lawsuits for instance learning and testing. Overall, classifiers that used cases outperformed classifiers that used rules. This is supported by the statistics of E1 and E2, where human experts provided the cases and rules. Also, despite the detailed differences between experiments from E3 through E9 CASE-BASED REASONING FOR CRIMINAL SUMMARY JUDGMENTS IN CHINESE 795 and experiments from E10 through E12, the former group that used machine-generated cases achieved better performances than the latter group that used machine-generate rules. The statistics for E9 and E12 suggest that only when we aggressively merged machinegenerated cases might the machine-generated rules offered better performances. Statistics for E1 and for E3 through E9 suggest that classifiers that used human-provided cases outperformed classifiers that used machine-generated cases, but statistics for E2 and for E10 through E12 are not conclusive. AF and WF suggest that human-provided rules beat machine-generated rules, but CR and RR suggest otherwise. Merging the cases does not always reduce the accuracy, due to the existence of similar prior lawsuits. The performances in E5 and E11 were similar to those in E3 and E10, respectively, although we cut the sizes of the databases by respectively about a half and a quarter. Using an appropriate π, our algorithm was able to construct case instances that offered satisfactory performance. Therefore the results suggest that our algorithms are viable for assisting the task of case-instance construction. When we merged cases more and more aggressively, the performances degraded almost monotonically from E3 through E9. When we set π to extremely low, and impractical, values, i.e., 0.2 and 0.3, we obtained approximately the same number of cases as human-provided case database did. The differences in performance measures shed light on the range of difference between human experts and a fully automated case generation mechanism. In terms of the average of precisions, recalls and F measures, the differences are 40% (E1 vs. E9) for cases and 18% for rules (E2 vs. E12). 6.3 Removing Irrelevant Terms As described in section 5.2, there are three different integrations of merging similar cases and removing irrelevant keywords. We conducted experiments for these setups and recorded statistics of the results. Charts in Fig. 4 show the trends of the WF measures for different setups of π and τ in three integrations. Specifically, we set π to 1, 0.7, 0.5, and 0, 3, and τ to 0, 0.5, 1.0, 1.5, and 2. Recall that we did not remove keywords when we set τ to 0. Similarly, we did not merge cases when we set π to 1. In each chart, the title shows how we combined the operations of merging cases and removing keywords, the vertical axis shows the WF values, the abscissa shows the π parameter in Merge2Instances. Notes in the parenthesized titles, i.e., before and after, indicate whether we computed AOF Merge->Remove (before) Remore->Merge Merge->Remove (after) 95% 85% 75% 65% 55% 45% 35% 25% 95% 85% 75% 65% 55% 45% 35% 25% 85% 0 0.5 1 1.5 2 65% 45% 25% 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 Fig. 4. Performance in WF measures for three integrations of removing irrelevant keywords and merging similar cases. CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO 796 values before or after we merged similar cases. The charts share the keys for values of τ that is contained in the middle chart, and each curve shows the trend of the WF values when we set τ to different possible values. Charts in Fig. 4 show that performance degraded as we progressively merged similar cases and removed irrelevant keywords. All curves move from the upper left corner toward the lower right corner. The curves marked with solid squares show the results of not removing any keywords in the experiments, and can serve as a standard for comparing effects of removing irrelevant keywords. Comparisons between curves in the left two charts with those in the rightmost chart in Fig. 4 suggest that merging similar cases before removing irrelevant keywords provided better performance. A careful examination of these curves reveals more interesting trends. A comparison between the leftmost and the middle charts shows that, given that we determined to remove irrelevant keywords after merging similar cases, calculating the AOF values before we merged cases offered better performance, while these alternatives show similar rejection rates in corresponding charts in Fig. 5. AOF values calculated based on the original cases seem to provide better term frequencies in the lawsuits. Merge->Remore(after) Merge>Remove(before) 0.0 0.5 1.0 1.5 2.0 90% 75% 60% 45% 30% 15% 0% 90% 75% 60% 45% 30% 15% 0% 1 0.8 0.6 0.4 0.2 Remove->Merge 1 0.8 90% 75% 60% 45% 30% 15% 0% 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 Fig. 5. Performance in rejection rates for the three integrations. The curve marked with solid diamonds in the leftmost chart in Fig. 4 suggests that, at appropriate levels of π and τ, it is possible to simplify original case databases without significantly sacrificing the performance. Charts in Fig. 5 further suggest that merging similar cases before removing keywords offers better performances at relatively lower rejection rates. Correct and rejection rates provide a different aspect for evaluating the integrations. Statistics in Table 4 show that the correct rates for most setups are above 90%. It may be surprising that the classifiers maintain the high correct rates when we aggressively removed irrelevant keywords and merged similar cases. We believe that this was because, at large π and τ, the resulting case instances become conservatively typical. Consequently, we got high correct and rejection rates simultaneously. Because of the high correct rates, we focus on the rejection rates for the different setups, and show the trends in Fig. 5. The number of surviving cases decreased as we increased π. Similarly, we reduced the total storage space for the cases as we increased τ. Comparing the statistics for π = 0.7 and τ = 0.5 and the statistics for π = 1 and τ = 0, we have reduced the number of cases and the storage space by about one half. Unfortunately, we cannot include the CASE-BASED REASONING FOR CRIMINAL SUMMARY JUDGMENTS IN CHINESE 797 Table 4. Correct and rejection rates for combining the merging(M) and removing(R) Methods. τ π M->R CR (before) RR M->R CR (after) RR CR R->M RR 0 0.5 1 1.5 2 1 0.7 0.5 0.3 1 0.7 0.5 0.3 1 0.7 0.5 0.3 1 0.7 0.5 0.3 1 0.7 0.5 0.3 92 94 96 92 94 94 96 93 91 90 82 70 96 96 100 96 99 98 100 100 3 5 14 44 4 4 10 31 7 9 17 38 12 17 53 68 22 27 53 77 92 94 96 92 94 93 95 91 91 88 91 76 96 96 98 95 99 88 98 97 3 5 14 44 4 5 13 44 7 8 19 46 12 18 48 72 22 28 55 74 92 94 96 92 94 95 94 93 91 90 96 77 96 97 98 100 99 99 99 98 3 5 14 44 4 6 21 60 7 12 39 55 12 23 58 76 22 38 63 84 detailed statistics due to page limits. This desirable result should not surprise us because the instance-generation method, discussed in section 4.1, is far from perfect. It inevitably includes unimportant keywords into cases, and constructs redundant cases for similar lawsuits. Our refinement strategies remedy these shortcomings of the generation procedure. Notice that, when we increased τ and kept π stable, the number of surviving cases did not have to reduce when we removed keywords before merging similar cases. Removing keywords may make originally similar cases become dissimilar, thereby making case clustering less possible. Nevertheless, we reduced the sizes of the surviving cases, and reduced the total storage sizes as a result. The preceding analyses indicate that removing irrelevant keywords and merging similar cases together offer a viable way for refining automatically generated case instances for lawsuit classification problems. 7. CONCLUDING REMARKS We report experimental results and analyses for applying our algorithms to the construction of a CBR-like system for lawsuit classification problems. The proposed algorithms generate and refine case instances from judgment documents. Experimental results indicate that, adopting appropriate parameters, our algorithms are able to construct case instances that offer competitive performance for lawsuit classification. Unlike GANIC [22] that attempts to find the best combination of factors from a bag of candidates by genetic algorithms, our work attempts to identify distinguishing words directly from the raw documents, and removes relatively irrelevant keywords. The precision and recall measures achieved by the current system are relatively better than those achieved by Thompson’s [24]. We believe that this is partially due to the fact that, at this moment, we consider lawsuits that have only one defendant who committed only one crime. Because of this constraint, we allow ourselves to choose only one prosecution reason in the experiments, and this may contribute to the high classification accuracy for some crime types. However, exactly this constraint may mislead our systems. The prosecution reasons are not perfectly exclusively applicable to a criminal activity. There is not always a clean cut in the descriptions for C5, C6, and C7 (and C3 and C10), for instance. Prosecutors may 798 CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO use the same words to describe these crime types, and inevitably mislead our classifiers. When ambiguities do occur, our classifiers are forced to classify the lawsuits into a category that might not agree with the judges’ decisions. We plan to further expand the sizes of the training and test data and the coverage of prosecution reasons in the future. We are also comparing more complex formalisms for representing case instances. Our counting duplicated words only once in Counts(Y) in (1) in E12 was an arbitrary design decision. Multiple occurrences of a keyword may signal an important clue to the judgment of the test lawsuit, so assigning higher weights to such keywords may be more appropriate. The problem is how much weight should be granted to these keywords so that their repeated occurrences receive appropriate recognition while we do not allow repeated keywords to dominate the final judgment. This selection of weights is entangled with the consideration of the length of the CASA, and is a topic for our future study. The reported results support our hope to apply our classifiers as a step stone for assisting the drafting or proofreading of judgment documents. Experiences show that, in Taiwan, about 2% of judgment documents are not well written in that the cited criminal activities do not perfectly support the criminal law articles that are applied against the defendants. This problem can be alleviated if we proofread the judgment documents. Our techniques should be helpful for assisting human experts to build such a proofreading system, particularly after we expand the current system to process lawsuits that involve multiple defendants who cooperatively commit one or more crimes in some manners. Our generation and refinement of the cases depend just on the lexical information, and leaves much room for improvements. One possibility is to index cases with the help of summarization techniques [21], and we are ready to apply our experience to this field [26]. Adding more reasoning capability for inferring the contextual information in the lawsuits will also help [11]. In addition, we believe that lexical information contained in thesauri can be helpful for processing Chinese legal documents [7]. ACKNOWLEDGMENTS We thank several anonymous reviewers for their invaluable comments on both writing and technical problems in previous versions of this paper. This research was supported in part by Grants NSC-91-2213-E-004-013 and NSC-92-2213-004-004 from the National Science Council of Taiwan. REFERENCES 1. Some active sites, as of 12 September 2003: The Judicial Yuan (http://wjirs.judicial. gov.tw/jirs/); The Legislative Yuan (http://www.ly.gov.tw/); The Ministry of Justice (http://www.moj.gov.tw/); Lawbank Inc. (http://www.lawbank.com.tw/); http://www. ordos.nm.cn/haoxia/navigation/zhengfa.htm. 2. V. Aleven, Teaching Case-based Argumentation Through a Model and Examples, Ph.D. Dissertation, Intelligent Systems Program, University of Pittsburgh, U.S.A. 3. L. K. Branting, J. Lester, and C. Callaway, “Automating judicial document drafting: a discourse-based approach,” Artificial Intelligence and Law, Vol. 6, 1998, pp. CASE-BASED REASONING FOR CRIMINAL SUMMARY JUDGMENTS IN CHINESE 799 111-149. 4. G. Brown, “CHINATAX: exploring isomorphism with Chinese law,” in Proceedings of the 4th International Conference on Artificial Intelligence and Law, 1993, pp. 175-179. 5. S. Brüninghaus and K. D. Ashley, “Toward adding knowledge to learning algorithms for indexing legal cases,” in Proceedings of the 7th International Conference on Artificial Intelligence and Law, 1999, pp. 9-17. 6. S. Brüninghaus and K. D. Ashley, “Improving the representation of legal case texts with information extraction methods,” in Proceedings of the 8th International Conference on Artificial Intelligence and Law, 2001, pp. 42-51. 7. A. Cammelli and F. Socci, “A thesaurus for improving information retrieval in an integrated legal expert system,” in Proceedings of the 9th International Workshop on Database and Expert Systems Applications, 1998, pp. 619-624. 8. C. T. Chang, J. H. Ho, and C. L. Liu, “Decision support for criminal summary judgment,” in Proceedings of the TAAI 7th Conference on AI and Applications, 2002, pp. 178-183. 9. C. S. Chen, “Legal informatics in Germany,” China Law Journal, Vol. 46, 2001, pp. 79-85. 10. S. Cost and S. Salzberg, “A weighted nearest neighbor algorithm for learning with symbolic features,” Machine Learning, Vol. 10, 1993, pp. 57-78. 11. C. Hafner and D. Berman, “The role of context in case-based legal reasoning: teleological, temporal and procedural,” Artificial and Law, Vol. 10, 2002, pp. 19-64. 12. D. S. Hirschberg, “Algorithms for the longest common subsequence problem,” Journal of the ACM, Vol. 24, 1997, pp. 664-675. 13. HowNet. http://www.keenage.com 14. T. B. Y. Lai and C. Huang, “Dependency-based syntactic analysis of Chinese and annotation of parsed corpus,” in Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, 2000, pp. 255-262. 15. D. B. Leake, (ed.,) Case-based Reasoning, MIT Press, 1996. 16. J. Legrand, “A contribution to indexing in legal information retrieval,” in Proceedings of the 8th Database and Expert Systems Applications, 1997, pp. 213-218. 17. C. L. Liu and C. T. Chang. “Some case-refinement strategies for case-based criminal summary judgments,” Lecture Notes in Artificial Intelligence 2871, 2003, pp. 285-291. 18. C. L. Liu, C. T. Chang, and J. H. Ho, “Classification and clustering for case-based criminal summary judgments,” in Proceedings of the 9th International Conference on Artificial Intelligence and Law, 2003, pp. 252-261. 19. C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing, MIT Press, 1999. 20. D. Merkl, E. Schweighofer, and W. Winiwarter, “Exploratory analysis of concept and document spaces with connectionist networks,” Artificial Intelligence and Law, Vol. 7, 1999, pp. 185-209. 21. M. F. Moens, C. Uyttendaele, and J. Dumortier, “Abstracting of legal cases: the SALOMON experience,” in Proceedings of the 6th International Conference on Ar- 800 CHAO-LIN LIU, CHENG-TSUNG CHANG AND JIM-HOW HO tificial Intelligence and Law, 1997, pp. 114-122. 22. A. S. Pannu, “Using genetic algorithms to inductively reason with cases in the legal domain,” in Proceedings of the 5th International Conference on Artificial Intelligence and Law, 1995, pp. 175-184. 23. U. J. Schild, “Intelligent computer systems for criminal sentencing,” in Proceedings of the 5th International Conference on Artificial Intelligence and Law, 1995, pp. 229-238. 24. P. Thompson, “Automatic categorization of case law,” in Proceedings of the 8th International Conference on Artificial Intelligence and Law, 2001, pp. 70-77. 25. R. Weber, “Intelligent jurisprudence research: a new concept,” in Proceedings of the 7th International Conference on Artificial Intelligence and Law, 1999, pp. 164-172. 26. C. W. Wu and C. L. Liu, “Ontology-based text summarization for business news articles,” in Proceedings of the ISCA 18th International Conference on Computers and Their Applications, 2003, pp. 389-392. Chao-Lin Liu (劉昭麟) serves as a faculty member at the Department of Computer Science of the National Chengchi University (NCCU), and his research interests include probabilistic reasoning, machine learning, natural language processing, and intelligent transportation systems. http://www.cs.nccu.edu.tw/~chaolin. Cheng-Tsung Chang (張正宗) received his M.S. degree in Computer Science from the National Chengchi University (NCCU) in 2003, and is serving the compulsory military service in Taiwan. Jim-How Ho (何君豪) received his doctoral degree in Law from the National Chengchi University (NCCU) in 2000, and is pursuing his M.S. degree in Computer Science. He also serves as a judge at the Pan-Chiao District Court.