Proceedings of the 23rd International. Conference on Computational Linguistics,. Beijing, China, August 2010. Learning to Predict Readability using Diverse Linguistic Features Rohit J. Kate1 Xiaoqiang Luo2 Siddharth Patwardhan2 Martin Franz2 Radu Florian2 Raymond J. Mooney1 Salim Roukos2 Chris Welty2 1 Department of Computer Science The University of Texas at Austin {rjkate,mooney}@cs.utexas.edu 2 IBM Watson Research Center {xiaoluo,spatward,franzm,raduf,roukos,welty}@us.ibm.com Abstract retrieved documents thus improving user satisfac- tion. Readability judgements can also be used In this paper we consider the problem of for automatically grading essays, selecting in- building a system to predict readability structional reading materials, etc. If documents of natural-language documents. Our sys- are generated by machines, such as summariza- tem is trained using diverse features based tion or machine translation systems, then they are on syntax and language models which are prone to be less readable. In such cases, a read- generally indicative of readability. The ability measure can be used to automatically fil- experimental results on a dataset of docu- ter out documents which have poor readability. ments from a mix of genres show that the Even when the intended consumers of text are predictions of the learned system are more machines, for example, information extraction or accurate than the predictions of naive hu- knowledge extraction systems, a readability mea- man judges when compared against the sure can be used to filter out documents of poor predictions of linguistically-trained expert readability so that the machine readers will not ex- human judges. The experiments also com- tract incorrect information because of ambiguity pare the performances of different learn- or lack of clarity in the documents. ing algorithms and different types of fea- ture sets when used for predicting read- As part of the DARPA Machine Reading Pro- ability. gram (MRP), an evaluation was designed and con- ducted for the task of rating documents for read- 1 Introduction ability. In this evaluation, 540 documents were rated for readability by both experts and novice An important aspect of a document is whether it human subjects. Systems were evaluated based on is easily processed and understood by a human whether they were able to match expert readabil- reader as intended by its writer, this is termed ity ratings better than novice raters. Our system as the document’s readability. Readability in- learns to match expert readability ratings by em- volves many aspects including grammaticality, ploying regression over a set of diverse linguistic conciseness, clarity, and lack of ambiguity. Teach- features that were deemed potentially relevant to ers, journalists, editors, and other professionals readability. Our results demonstrate that a rich routinely make judgements on the readability of combination of features from syntactic parsers, documents. We explore the task of learning to language models, as well as lexical statistics all automatically judge the readability of natural- contribute to accurately predicting expert human language documents. readability judgements. We have also considered In a variety of applications it would be useful to the effect of different genres in predicting read- be able to automate readability judgements. For ability and how the genre-specific language mod- example, the results of a web-search can be or- els can be exploited to improve the readability pre- dered taking into account the readability of the dictions. 2 Related Work task more general. The data includes human writ- ten as well as machine generated documents. The There is a significant amount of published work task and the data has been set this way because it on a related problem: predicting the reading diffi- is aimed at filtering out documents of poor quality culty of documents, typically, as the school grade- for later processing, like for extracting machine- level of the reader from grade 1 to 12. Some early processable knowledge from them. Extracting methods measure simple characteristics of docu- knowledge from openly found text, such as from ments like average sentence length, average num- the internet, is becoming popular but the quality ber of syllables per word, etc. and combine them of text found “in the wild”, like found through using a linear formula to predict the grade level of searching the internet, vary considerably in qual- a document, for example FOG (Gunning, 1952), ity and genre. If the text is of poor readability then SMOG (McLaughlin, 1969) and Flesh-Kincaid it is likely to lead to extraction errors and more (Kincaid et al., 1975) metrics. These methods problems downstream. If the readers are going do not take into account the content of the doc- to be humans instead of machines, then also it is uments. Some later methods use pre-determined best to filter out poorly written documents. Hence lists of words to determine the grade level of a identifying readability of general text documents document, for example the Lexile measure (Sten- coming from various sources and genres is an im- ner et al., 1988), the Fry Short Passage measure portant task. We are not aware of any other work (Fry, 1990) and the Revised Dale-Chall formula which has considered such a task. (Chall and Dale, 1995). The word lists these Secondly, we note that all of the above ap- methods use may be thought of as very simple proaches that use language models train a lan- language models. More recently, language mod- guage model for each difficulty level using the els have been used for predicting the grade level training data for that level. However, since the of documents. Si and Callan (2001) and Collins- amount of training data annotated with levels Thompson and Callan (2004) train unigram lan- is limited, they can not train higher-order lan- guage models to predict grade levels of docu- guage models, and most just use unigram models. ments. In addition to language models, Heilman In contrast, we employ more powerful language et al. (2007) and Schwarm and Ostendorf (2005) models trained on large quantities of generic text also use some syntactic features to estimate the (which is not from the training data for readabil- grade level of texts. ity) and use various features obtained from these Pitler and Nenkova (2008) consider a differ- language models to predict readability. Thirdly, ent task of predicting text quality for an educated we use a more sophisticated combination of lin- adult audience. Their system predicts readabil- guistic features derived from various syntactic ity of texts from Wall Street Journal using lex- parsers and language models than any previous ical, syntactic and discourse features. Kanungo work. We also present ablation results for differ- and Orr (2009) consider the task of predicting ent sets of features. Fourthly, given that the doc- readability of web summary snippets produced by uments in our data are not from a particular genre search engines. Using simple surface level fea- but from a mix of genres, we also train genre- tures like the number of characters and syllables specific language models and show that including per word, capitalization, punctuation, ellipses etc. these as features improves readability predictions. they train a regression model to predict readability Finally, we also show comparison between var- values. ious machine learning algorithms for predicting Our work differs from this previous research in readability, none of the previous work compared several ways. Firstly, the task we have consid- learning algorithms. ered is different, we predict the readability of gen- 3 Readability Data eral documents, not their grade level. The doc- uments in our data are also not from any single The readability data was collected and re- domain, genre or reader group, which makes our leased by LDC. The documents were collected from the following diverse sources or genres: ability of documents by training on expert hu- newswire/newspaper text, weblogs, newsgroup man judgements of readability. The evaluation posts, manual transcripts, machine translation out- was then designed to compare how well machine put, closed-caption transcripts and Wikipedia arti- and naive human judges predict expert human cles. Documents for newswire, machine transla- judgements. In order to make the machine’s pre- tion and closed captioned genres were collected dicted score comparable to a human judge’s score automatically by first forming a candidate pool (details about our evaluation metrics are in Sec- from a single collection stream and then randomly tion 6.1), we also restricted the machine scores to selecting documents. Documents for weblogs, integers. Hence, the task is to predict an integer newsgroups and manual transcripts were also col- score from 1 to 5 that measures the readability of lected in the same way but were then reviewed the document. by humans to make sure they were not simply This task could be modeled as a multi-class spam articles or something objectionable. The classification problem treating each integer score Wikipedia articles were collected manually, by as a separate class, as done in some of the previ- searching through a data archive or the live web, ous work (Si and Callan, 2001; Collins-Thompson using keyword and other search techniques. Note and Callan, 2004). However, since the classes that the information about genres of the docu- are numerical and not unrelated (for example, the ments is not available during testing and hence score 2 is in between scores 1 and 3), we de- was not used when training our readability model. cided to model the task as a regression problem A total of 540 documents were collected in this and then round the predicted score to obtain the way which were uniformly distributed across the closest integer value. Preliminary results verified seven genres. Each document was then judged that regression performed better than classifica- for its readability by eight expert human judges. tion. Heilman et al. (2008) also found that it These expert judges are native English speakers is better to treat the readability scores as ordinal who are language professionals and who have than as nominal. We take the average of the ex- specialized training in linguistic analysis and an- pert judge scores for each document as its gold- notation, including the machine translation post- standard score. Regression was also used by Ka- editing task. Each document was also judged for nungo and Orr (2009), although their evaluation its readability by six to ten naive human judges. did not constrain machine scores to be integers. These non-expert (naive) judges are native En- We tested several regression algorithms avail- glish speakers who are not language professionals able in the Weka1 machine learning package, and (e.g. editors, writers, English teachers, linguistic in Section 6.2 we report results for several which annotators, etc.) and have no specialized language performed best. The next section describes the analysis or linguistic annotation training. Both ex- numerically-valued features that we used as input pert and naive judges provided readability judg- for regression. ments using a customized web interface and gave a rating on a 5-point scale to indicate how readable 5 Features for Predicting Readability the passage is (where 1 is lowest and 5 is highest Good input features are critical to the success of readability) where readability is defined as a sub- any regression algorithm. We used three main cat- jective judgment of how easily a reader can extract egories of features to predict readability: syntac- the information the writer or speaker intended to tic features, language-model features, and lexical convey. features, as described below. 4 Readability Model 5.1 Features Based on Syntax We want to answer the question whether a Many times, a document is found to be unreadable machine can accurately estimate readability as due to unusual linguistic constructs or ungram- judged by a human. Therefore, we built a 1 machine-learning system that predicts the read- http://www.cs.waikato.ac.nz/ml/weka/ matical language that tend to manifest themselves mum entropy statistical parer, average constituent in the syntactic properties of the text. There- scores etc., however, they slightly degraded the fore, syntactic features have been previously used performance in combination with the rest of the (Bernth, 1997) to gauge the “clarity” of written features and hence we did not include them in text, with the goal of helping writers improve their the final set. One possible explanation could be writing skills. Here too, we use several features that averaging diminishes the effect of low scores based on syntactic analyses. Syntactic analyses caused by ungrammaticality. are obtained from the Sundance shallow parser (Riloff and Phillips, 2004) and from the English 5.2 Features Based on Language Models Slot Grammar (ESG) (McCord, 1989). A probabilistic language model provides a predic- Sundance features: The Sundance system is a tion of how likely a given sentence was generated rule-based system that performs a shallow syntac- by the same underlying process that generated a tic analysis of text. We expect that this analysis corpus of training documents. In addition to a over readable text would be “well-formed”, adher- general n-gram language model trained on a large ing to grammatical rules of the English language. body of text, we also exploit language models Deviations from these rules can be indications of trained to recognize specific “genres” of text. If a unreadable text. We attempt to capture such de- document is translated by a machine, or casually viations from grammatical rules through the fol- produced by humans for a weblog or newsgroup, lowing Sundance features computed for each text it exhibits a character that is distinct from docu- document: proportion of sentences with no verb ments that go through a dedicated editing process phrases, average number of clauses per sentence, (e.g., newswire and Wikipedia articles). Below average sentence length in tokens, average num- we describe features based on generic as well as ber of noun phrases per sentence, average number genre-specific language models. of verb phrases per sentence, average number of Normalized document probability: One obvi- prepositional phrases per sentence, average num- ous proxy for readability is the score assigned to ber of phrases (all types) per sentence and average a document by a generic language model (LM). number of phrases (all types) per clause. Since the language model is trained on well- ESG features: ESG uses slot grammar rules to written English text, it penalizes documents de- perform a deeper linguistic analysis of sentences viating from the statistics collected from the LM than the Sundance system. ESG may consider training documents. Due to variable document several different interpretations of a sentence, be- lengths, we normalize the document-level LM fore deciding to choose one over the other inter- score by the number of words and compute the pretations. Sometimes ESG’s grammar rules fail normalized document probability N P (D) for a to produce a single complete interpretation of a document D as follows: sentence, in which case it generates partial parses. ¡ ¢ 1 N P (D) = P (D|M) |D| , (1) This typically happens in cases when sentences are ungrammatical, and possibly, less readable. where M is a general-purpose language model Thus, we use the proportion of such incomplete trained on clean English text, and |D| is the num- parses within a document as a readability feature. ber of words in the document D. In case of extremely short documents, this propor- Perplexities from genre-specific language mod- tion of incomplete parses can be misleading. To els: The usefulness of LM-based features in account for such short documents, we introduce categorizing text (McCallum and Nigam, 1998; a variation of the above incomplete parse feature, Yang and Liu, 1999) and evaluating readability by weighting it with a log factor as was done in (Collins-Thompson and Callan, 2004; Heilman (Riloff, 1996; Thelen and Riloff, 2002). et al., 2007) has been investigated in previous We also experimented with some other syn- work. In our experiments, however, since doc- tactic features such as average sentence parse uments were acquired through several different scores from Stanford parser and an in-house maxi- channels, such as machine translation or web logs, we also build models that try to predict the genre modern LMs often have a very large vocabulary, of a document. Since the genre information for to get meaningful OOV rates, we truncate the vo- many English documents is readily available, we cabularies to the top (i.e., most frequent) 3000 trained a series of genre-specific 5-gram LMs us- words. For the purpose of OOV computation, a ing the modified Kneser-Ney smoothing (Kneser document D is treated as a sequence of tokenized and Ney, 1995; Stanley and Goodman, 1996). Ta- words {wi : i = 1, 2, · · · , |D|}. Its OOV rate ble 1 contains a list of a base LM and genre- with respect to a (truncated) vocabulary V is then: specific LMs. PD I(wi ∈/ V) Given a document D consisting of tokenized OOV (D|V) = i=1 , (4) word sequence {wi : i = 1, 2, · · · , |D|}, its per- |D| plexity L(D|Mj ) with respect to a LM Mj is where I(wi ∈ / V) is an indicator function taking computed as: value 1 if wi is not in V, and 0 otherwise. ¡ P|D| ¢ Ratio of function words: A characteristic of doc- − 1 log P (wi |hi ;Mj ) L(D|Mj ) = e |D| i=1 , (2) uments generated by foreign speakers and ma- chine translation is a failure to produce certain where |D| is the number of words in D and hi are function words, such as “the,” or “of.” So we pre- the history words for wi , and P (wi |hi ; Mj ) is the define a small set of function words (mainly En- probability Mj assigns to wi , when it follows the glish articles and frequent prepositions) and com- history words hi . pute the ratio of function words over the total Posterior perplexities from genre-specific lan- number words in a document: guage models: While perplexities computed from PD genre-specific LMs reflect the absolute probabil- I(wi ∈ F) RF (D) = i=1 , (5) ity that a document was generated by a specific |D| model, a model’s relative probability compared to where I(wi ∈ F) is 1 if wi is in the set of function other models may be a more useful feature. To this words F, and 0 otherwise. end, we also compute the posterior perplexity de- Ratio of pronouns: Many foreign languages that fined as follows. Let D be a document, {Mi }G i=1 are source languages of machine-translated docu- be G genre-specific LMs, and L(D|Mi ) be the ments are pronoun-drop languages, such as Ara- perplexity of the document D with respect to Mi , bic, Chinese, and romance languages. We conjec- then the posterior perplexity, R(Mi |D), is de- ture that the pronoun ratio may be a good indica- fined as: tor whether a document is translated by machine L(D|Mi ) or produced by humans, and for each document, R(Mi |D) = PG . (3) we first run a POS tagger, and then compute the j=1 L(D|Mj ) ratio of pronouns over the number of words in the We use the term “posterior” because if a uni- document: form prior is adopted for {Mi }G PD i=1 , R(Mi |D) can I(P OS(wi ) ∈ P) be interpreted as the posterior probability of the RP (D) = i=1 , (6) |D| genre LM Mi given the document D. where I(P OS(wi ) ∈ F) is 1 if the POS tag of wi 5.3 Lexical Features is in the set of pronouns, P , and 0 otherwise. The final set of features involve various lexical Fraction of known words: This feature measures statistics as described below. the fraction of words in a document that occur Out-of-vocabulary (OOV) rates: We conjecture either in an English dictionary or a gazetteer of that documents containing typographical errors names of people and locations. (e.g., for closed-caption and web log documents) 6 Experiments may receive low readability ratings. Therefore, we compute the OOV rates of a document with re- This section describes the evaluation methodol- spect to the various LMs shown in Table 1. Since ogy and metrics and presents and discusses our Genre Training Size(M tokens) Data Sources base 5136.8 mostly LDC’s GigaWord set NW 143.2 newswire subset of base NG 218.6 newsgroup subset of base WL 18.5 weblog subset of base BC 1.6 broadcast conversation subset of base BN 1.1 broadcast news subset of base wikipedia 2264.6 Wikipedia text CC 0.1 closed caption ZhEn 79.6 output of Chinese to English Machine Translation ArEn 126.8 output of Arabic to English Machine Translation Table 1: Genre-specific LMs: the second column contains the number of tokens in LM training data (in million tokens). experimental results. The results of the official Algorithm Correlation evaluation task are also reported. Bagged Decision Trees 0.8173 Decision Trees 0.7260 6.1 Evaluation Metric Linear Regression 0.7984 The evaluation process for the DARPA MRP read- SVM Regression 0.7915 ability test was designed by the evaluation team Gaussian Process Regression 0.7562 led by SAIC. In order to compare a machine’s Naive Judges predicted readability score to those assigned by Upper Critical Value 0.7015 the expert judges, the Pearson correlation coef- Distribution Mean 0.6517 ficient was computed. The mean of the expert- Baselines judge scores was taken as the gold-standard score Uniform Random 0.0157 for a document. Proportional Random -0.0834 To determine whether the machine predicts scores closer to the expert judges’ scores than Table 2: Comparing different algorithms on the readability what an average naive judge would predict, a task using 13-fold cross-validation on the 390 documents us- ing all the features. Exceeding the upper critical value of the sampling distribution representing the underlying naive judges’ distribution indicates statistically significantly novice performance was computed. This was ob- better predictions than the naive judges. tained by choosing a random naive judge for every document, calculating the Pearson correlation co- efficient with the expert gold-standard scores and used stratified 13-fold cross-validation in which then repeating this procedure a sufficient number the documents from various genres in each fold of times (5000). The upper critical value was set was distributed in roughly the same proportion as at 97.5% confidence, meaning that if the machine in the overall dataset. We first conducted experi- performs better than the upper critical value then ments to test different regression algorithms using we reject the null hypothesis that machine scores all the available features. Next, we ablated various and naive scores come from the same distribution feature sets to determine how much each feature and conclude that the machine performs signifi- set was contributing to making accurate readabil- cantly better than naive judges in matching the ex- ity judgements. These experiments are described pert judges. in the following subsections. 6.2 Results and Discussion 6.2.1 Regression Algorithms We evaluated our readability system on the dataset We used several regression algorithms available of 390 documents which was released earlier dur- in the Weka machine learning package and Table 2 ing the training phase of the evaluation task. We shows the results obtained. The default values Feature Set Correlation the results. The language-model feature set per- Lexical 0.5760 forms the best, but performance improves when it Syntactic 0.7010 is combined with the remaining features. The lex- Lexical + Syntactic 0.7274 ical feature set by itself performs the worst, even Language Model based 0.7864 below the naive distribution mean (shown in Ta- All 0.8173 ble 2); however, when combined with syntactic features it performs well. Table 3: Comparison of different linguistic feature sets. In our second ablation experiment, we com- pared the performance of genre-independent and in Weka were used for all parameters, changing genre-based features. Since the genre-based fea- these values did not show any improvement. We tures exploit knowledge of the genres of text used used decision tree (reduced error pruning (Quin- in the MRP readability corpus, their utility is lan, 1987)) regression, decision tree regression somewhat tailored to this specific corpus. There- with bagging (Breiman, 1996), support vector re- fore, it is useful to evaluate the performance of the gression (Smola and Scholkopf, 1998) using poly- system when genre information is not exploited. nomial kernel of degree two,2 linear regression Of the lexical features described in subsection 5.3, and Gaussian process regression (Rasmussen and the ratio of function words, ratio of pronoun words Williams, 2006). The distribution mean and the and all of the out-of-vocabulary rates except for upper critical values of the correlation coefficient the base language model are genre-based features. distribution for the naive judges are also shown in Out of the language model features described in the table. the Subsection 5.2, all of the perplexities except Since they are above the upper critical value, all for the base language model and all of the poste- algorithms predicted expert readability scores sig- rior perplexities3 are genre-based features. All of nificantly more accurately than the naive judges. the remaining features are genre-independent. Ta- Bagged decision trees performed slightly better ble 4 shows the results comparing these two fea- than other methods. As shown in the following ture sets. The genre-based features do well by section, ablating features affects predictive accu- themselves but the rest of the features help fur- racy much more than changing the regression al- ther improve the performance. While the genre- gorithm. Therefore, on this task, the choice of re- independent features by themselves do not exceed gression algorithm was not very critical once good the upper critical value of the naive judges’ dis- readability features are used. We also tested two tribution, they are very close to it and still out- simple baseline strategies: predicting a score uni- perform its mean value. These results show that formly at random, and predicting a score propor- for a dataset like ours, which is composed of a mix tional to its frequency in the training data. As of genres that themselves are indicative of read- shown in the last two rows of Table 2, these base- ability, features that help identify the genre of a lines perform very poorly, verifying that predict- text improve performance significantly.4 For ap- ing readability on this dataset as evaluated by our plications mentioned in the introduction and re- evaluation metric is not trivial. lated work sections, such as filtering less readable documents from web-search, many of the input 6.2.2 Ablations with Feature Sets documents could come from some of the common We evaluated the contributions of different fea- genres considered in our dataset. ture sets through ablation experiments. Bagged In our final ablation experiment, we evaluated decision-tree was used as the regression algorithm 3 Base model for posterior perplexities is computed using in all of these experiments. First we compared other genre-based LMs (equation 3) hence it can not be con- syntactic, lexical and language-model based fea- sidered genre-independent. 4 tures as described in Section 5, and Table 3 shows We note that none of the genre-based features were trained on supervised readability data, but were trained on 2 Polynomial kernels with other degrees and RBF kernel readily-available large unannotated corpora as shown in Ta- performed worse. ble 1. Feature Set Correlation System Correl. Avg. Diff. Target Hits Genre-independent 0.6978 Our (A) 0.8127 0.4844 0.4619 Genre-based 0.7749 System B 0.6904 0.3916 0.4530 All 0.8173 System C 0.8501 0.5177 0.4641 Upper CV 0.7423 0.0960 0.3713 Table 4: Comparison of genre-independent and genre- based feature sets. Table 6: Results of the systems that participated in the DARPA’s readability evaluation task. The three metrics used Feature Set By itself Ablated were correlation, average absolute difference and target hits from All measured against the expert readability scores. The upper critical values are for the score distributions of naive judges. Sundance features 0.5417 0.7993 ESG features 0.5841 0.8118 Perplexities 0.7092 0.8081 puted a score inversely proportional to that width. Posterior perplexities 0.7832 0.7439 The final target hits score was then computed by Out-of-vocabulary rates 0.3574 0.8125 averaging it across all the documents. The upper All 0.8173 - critical values for these metrics were computed in a way analogous to that for the correlation met- Table 5: Ablations with some individual feature sets. ric which was described before. Higher score is better for all the three metrics. Table 6 shows the results of the evaluation. Our system performed the contribution of various individual feature sets. favorably and always scored better than the up- Table 5 shows that posterior perplexities perform per critical value on each of the metrics. Its per- the strongest on their own, but without them, the formance was in between the performance of the remaining features also do well. When used by other two systems. The performances of the sys- themselves, some feature sets perform below the tems show that the correlation metric was the most naive judges’ distribution mean, however, remov- difficult of the three metrics. ing them from the rest of the feature sets de- grades the performance. This shows that no indi- 7 Conclusions vidual feature set is critical for good performance Using regression over a diverse combination of but each further improves the performance when syntactic, lexical and language-model based fea- added to the rest of the feature sets. tures, we built a system for predicting the read- ability of natural-language documents. The sys- 6.3 Official Evaluation Results tem accurately predicts readability as judged by An official evaluation was conducted by the eval- linguistically-trained expert human judges and uation team SAIC on behalf of DARPA in which exceeds the accuracy of naive human judges. three teams participated including ours. The eval- Language-model based features were found to be uation task required predicting the readability of most useful for this task, but syntactic and lexical 150 test documents using the 390 training docu- features were also helpful. We also found that for ments. Besides the correlation metric, two addi- a corpus consisting of documents from a diverse tional metrics were used. One of them computed mix of genres, using features that are indicative for a document the difference between the aver- of the genre significantly improve the accuracy of age absolute difference of the naive judge scores readability predictions. Such a system could be from the mean expert score and the absolute dif- used to filter out less readable documents for ma- ference of the machine’s score from the mean ex- chine or human processing. pert score. This was then averaged over all the documents. The other one was “target hits” which Acknowledgment measured if the predicted score for a document This research was funded by Air Force Contract fell within the width of the lowest and the highest FA8750-09-C-0172 under the DARPA Machine expert scores for that document, and if so, com- Reading Program. References Quinlan, J. R. 1987. Simplifying decision trees. Interna- tional Journal of Man-Machine Studies, 27:221–234. Bernth, Arendse. 1997. Easyenglish: A tool for improv- ing document quality. In Proceedings of the fifth con- Rasmussen, Carl and Christopher Williams. 2006. Gaussian ference on Applied Natural Language Processing, pages Processes for Machine Leanring. MIT Press, Cambridge, 159–165, Washington DC, April. MA. Breiman, Leo. 1996. Bagging predictors. Machine Learn- Riloff, E. and W. Phillips. 2004. An introduction to the Sun- ing, 24(2):123–140. dance and Autoslog systems. Technical Report UUCS- 04-015, University of Utah School of Computing. Chall, J.S. and E. Dale. 1995. Readability Revisited: The New Dale-Chall Readability Formula. Brookline Books, Riloff, Ellen. 1996. Automatically generating extraction Cambridge, MA. patterns from untagged text. In Proc. of 13th Natl. Conf. on Artificial Intelligence (AAAI-96), pages 1044–1049, Collins-Thompson, Kevyn and James P. Callan. 2004. A Portland, OR. language modeling approach to predicting reading diffi- culty. In Proc. of HLT-NAACL 2004, pages 193–200. Schwarm, Sarah E. and Mari Ostendorf. 2005. Reading level assessment using support vector machines and statistical Fry, E. 1990. A readability formula for short passages. Jour- language models. In Proc. of ACL 2005, pages 523–530, nal of Reading, 33(8):594–597. Ann Arbor, Michigan. Gunning, R. 1952. The Technique of Clear Writing. Si, Luo and James P. Callan. 2001. A statistical model for McGraw-Hill, Cambridge, MA. scientific readability. In Proc. of CIKM 2001, pages 574– 576. Heilman, Michael, Kevyn Collins-Thompson, Jamie Callan, and Maxine Eskenazi. 2007. Combining lexical and Smola, Alex J. and Bernhard Scholkopf. 1998. A tutorial grammatical features to improve readability measures for on support vector regression. Technical Report NC2-TR- first and second language texts. In Proc. of NAACL-HLT 1998-030, NeuroCOLT2. 2007, pages 460–467, Rochester, New York, April. Stanley, Chen and Joshua Goodman. 1996. An empirical Heilman, Michael, Kevyn Collins-Thompson, and Maxine study of smoothing techniques for language modeling. In Eskenazi. 2008. An analysis of statistical models and fea- Proc. of the 34th Annual Meeting of the Association for tures for reading difficulty prediction. In Proceedings of Computational Linguistics (ACL-96), pages 310–318. the Third Workshop on Innovative Use of NLP for Build- ing Educational Applications, pages 71–79, Columbus, Stenner, A. J., I. Horabin, D. R. Smith, and M. Smith. 1988. Ohio, June. Association for Computational Linguistics. The Lexile Framework. Durham, NC: MetaMetrics. Kanungo, Tapas and David Orr. 2009. Predicting the read- Thelen, M. and E. Riloff. 2002. A bootstrapping method for ability of short web summaries. In Proc. of WSDM 2009, learning semantic lexicons using extraction pattern con- pages 202–211, Barcelona, Spain, February. texts. In Proc. of EMNLP 2002, Philadelphia, PA, July. Kincaid, J. P., R. P. Fishburne, R. L. Rogers, and B.S. Yang, Yiming and Xin Liu. 1999. A re-examination of text Chissom. 1975. Derivation of new readability formulas cateogrization methods. In Proc. of 22nd Intl. ACM SI- for navy enlisted personnel. Technical Report Research GIR Conf. on Research and Development in Information Branch Report 8-75, Millington, TN: Naval Air Station. Retrieval, pages 42–48, Berkeley, CA. Kneser, Reinhard and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proc. of ICASSP-95, pages 181–184. McCallum, Andrew and Kamal Nigam. 1998. A comparison of event models for naive Bayes text classification. In Pa- pers from the AAAI-98 Workshop on Text Categorization, pages 41–48, Madison, WI, July. McCord, Michael C. 1989. Slot grammar: A system for simpler construction of practical natural language gram- mars. In Proceedings of the International Symposium on Natural Language and Logic, pages 118–145, May. McLaughlin, G. H. 1969. Smog: Grading: A new readabil- ity formula. Journal of Reading, 12:639–646. Pitler, Emily and Ani Nenkova. 2008. Revisiting readability: A unified framework for predicting text quality. In Proc. of EMNLP 2008, pages 186–195, Waikiki,Honolulu,Hawaii, October.
US