(PDF) OCR Post-Processing Error Correcti

Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf OCR POST-PROCESSING ERROR CORRECTION ALGORITHM USING GOOGLE'S ONLINE SPELLING SUGGESTION Youssef Bassil, Mohammad Alwani LACSC – Lebanese Association for Computational Sciences Registered under No. 957, 2011, Beirut, Lebanon

[email protected]

ABSTRACT With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occasionally mis-recognizes letters and falsely identifies scanned text, leading to misspellings and linguistics errors in the OCR output text. This paper proposes a post-processing context-based error correction algorithm for detecting and correcting OCR non-word and real-word errors. The proposed algorithm is based on Google‟s online spelling suggestion which harnesses an internal database containing a huge collection of terms and word sequences gathered from all over the web, convenient to suggest possible replacements for words that have been misspelled during the OCR process. Experiments carried out revealed a significant improvement in OCR error correction rate. Future research can improve upon the proposed algorithm so much so that it can be parallelized and executed over multiprocessing platforms. Keywords: Optical Character Recognition, Error Correction, Google Spelling Suggestion, Postprocessing. 1. INTRODUCTION procedure is considered costly, time consuming, laborious, The drastic introduction of modern computers into and error-prone as the human eye may miss some every area of life has radically led to a paradigm shift in mistakes. A better approach, could be automating the the way people trade, communicate, learn, share correction of misspelled words using computer software knowledge, and get entertained. Present-day computers are such as spell checkers. This solution consists of using a electronic and digital, and thus they can only process data lookup dictionary to search for misspelled words and in digital format. Given that, anything that requires a correcting them suitably. While this technique tries to computer processing must first be transformed into a solve the actual problem, it in fact introduces another digital form. For instance, the Boston Public Library problem, yet more awkward. In effect, the dictionary which features more than 6.1 million books [1], all open to approach tries to look at the misspelled word in isolation, public, inevitably has to convert all its paper-based books in a sense that it does not take into consideration the into digital documents so that they can be stored on a context in which the error has occurred. For this reason, computer‟s hard drive. In the same context, it has been linguistic context-based error correction techniques were estimated that more than 200 million books are being proposed to detect and correct OCR errors with respect to published every year [2], many of which are being their grammatical and semantic context [5, 25]. As a distributed and printed on traditional papers [3]. In view of result, the net outcome using context-based error that, it is impossible to store all these books on a computer correction can be noteworthy as it greatly improves the and manage them using software applications unless first OCR error correction rate [6]. converted into a digital form. Obviously, all of the aforementioned methods have still OCR, short for Optical Character Recognition is the a common drawback; they all require the integration of a process of converting scanned images of text into editable vast dictionary of massive terms that covers almost every digital documents that can be processed, edited, searched, single word in the target language. Additionally, this saved, and copied for an unlimited number of times dictionary should encompass proper nouns, names of without any degradation or loss of information using a countries and locations, scientific terminologies, and computer. Although OCR sounds perfect for transforming technical keywords. To end with, the content of this a traditional library into an e-library, it is subject to errors dictionary should be constantly updated so as to include and shortcomings. Practically, the error rate of OCR new emerging words in the language. Since in practice it is systems can fairly become high, occasionally close to 10% almost impossible to compile such a wide-ranging [4], if the papers being scanned have numerous defects dictionary, it would be wise using a web of online text such as bad physical condition, poor printing quality, corpuses containing all possible words, terms, expressions, discolored materials, and old age papers. When an OCR jargons, and terminologies that have ever occurred in the system fails to recognize a character, an OCR error is language. This web of words can be seamlessly provided produced, commonly causing a spelling mistake in the by Google search engine [30]. output text. For instance, character “B” can be improperly This paper proposes a new post-processing method for converted into number “8”, character “S” into number “5”, OCR error correction based on spelling suggestion and the character “O” into number “0”, and so forth. To remedy “did you mean” feature of Google‟s online web search this problem, humans can manually review and correct the engine. The goal of this approach is to automate the OCR output text by hand. To a certain extent, this proofreading of OCR text and provide context-based Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf detection and correction of OCR errors. The process starts and does correspond to an entry in the lexicon, albeit it is by chunking the OCR output text B, possibly containing grammatically incorrect with respect to the sentence in spelling mistakes, into blocks of five words each. Then, which it has occurred. For instance, when “How is your every single block in B = {b0, b1, b2…bn} is submitted as a day” is recognized by the OCR system as “How is you search query to Google‟s web search engine; if the search day”, then “you” is considered a real-word error because returns “did you mean: ci ” where ci is the alternative “you” although is syntactically correct (available in the spelling suggestion for block bi , then block bi is considered English language), its usage in the sentence is misspelled and is replaced by the suggested block ci . grammatically incorrect. Typically, non-word and real- Otherwise, in case no suggestion is returned, block bi word errors fall under three classes of errors: deletion, remains intact and is appended to the list of correct blocks. insertion, and substitution errors. The deletion error occurs Eventually, the fully corrected OCR text is the collection when one or more characters are discarded or removed of the correct blocks, formally represented as C = { c0, c1, from within the original word. For example, mis- c2…cn }. recognizing the word “House” as “Hose”, “Huse”, “Hse”, or even “ouse”. The insertion error occurs when one or 2. OPTICAL CHARACTER more extra characters are added or stiffed to the original RECOGNITION word. For instance, mis-recognizing the word “Science” as Optical Character Recognition (OCR) is the process of “Sciencce” or even “Sciience”. The substitution error translating images of handwritten or typewritten text into occurs when one or more characters are accidently machine-editable text [4]. These images are commonly changed in the original word, such as changing the captured using computer scanners or digital cameras. The character “m” in “Computer” to “n” or changing the quality of the images being scanned plays a critical role in character “g” in “Against” to “q”. determining the error rate in the recognized text. For The poor condition of the papers being processed is by instance, OCR systems may lead to poor and insignificant far the lone culprit for producing OCR errors and results if their input source is physically out of condition, consequently causing OCR systems either to operate of old age, having low printing quality, and containing imprecisely or to fail utterly. Therefore, countless post- imperfections and distortions such as rips, stains, blots, processing approaches and algorithms were proposed in an and discolorations [7, 8]. attempt to detect and correct OCR errors. In sum, they can Two types of optical character recognition systems be broadly broken down into three major categories: exist. The first type is the offline OCR system which manual error correction, dictionary-based error correction, extracts data from scanned images through optical and context-based error correction. scanners and cameras; while the second type is the online OCR system which employs special digitizers to capture 3.1 Manual Error Correction in real-time the user‟s writing according to the order of the Intuitively, the easiest way to correct OCR errors is to lettering, speed, and pen movements and strokes. hire a group of people to sit down and try to edit the OCR Technically speaking, every OCR system undergoes a output text manually. This approach is often known as process of sequential stages in order to convert a paper proofreading and although is straightforward, it requires a text document into a computer digital text. This process continuous manual human intervention. Distributed consists of the image acquisition stage which captures the Proofreaders (DP) [10] initially initiated by Charles Franks input document; the pre-processing stage which improves in 2000 and originally meant to assist the Project the quality of and removes artifacts from the input Gutenberg (PG) [11], is a web-based project designed to document; the feature extraction and classification stage facilitate the collaborative conversion and proofreading of which extracts similar objects from the input document paper books into e-books. The idea of DP is to employ and groups them into classes so that they can be volunteers from all around the world to compare scanned recognized as characters and words; and finally the post- documents with their corresponding OCR texts. processing stage which refines the OCR output text by Proofreading and correction of OCR errors are done correcting linguistic misspellings. through several rounds by several people as necessary. Once the process is completed, the verified OCR texts are assembled together and added to the Project Gutenberg 3. OCR POST-PROCESSING archive. As discussed in the previous section, post-processing is Despite the fact that proofreading is achievable, it is the last activity to occur in a series of OCR processing still considered error-prone as humans may unintentionally stages. Chiefly, the goal of post-processing is to detect and overlook or miss some mistakes. Furthermore, manual correct linguistic misspellings in the OCR output text after correction is to some degree regarded as a laborious, the input image has been scanned and completely costly, and time-consuming practice. processed. Fundamentally, there are two types of OCR errors: non-word errors and real-word errors [9]. A non- 3.2 Dictionary-Based Error Correction word error is a word that is recognized by the OCR In a relentless effort to find a way to better detect and system; however, it does not correspond to any entry in the correct misspelled words in OCR text, researchers lexicon. For instance, when “How is your day” is conceived the dictionary-based error correction recognized by the OCR system as “Huw is your day”, then methodology, also known as lexical error correction. In “Huw” is said to be a non-word error because “Huw” is this approach, a lexicon or a lookup dictionary is used to not defined in the English language. In contrast, a real- spell check OCR recognized words and correct them if word error is a word that is recognized by the OCR system they are misspelled [12]. In some cases, a list of Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf candidates is generated to assist in the correction of passes. On every pass, a spell checker tool intervenes to misspelled words. For instance, the correction candidates detect and correct misspelled words. After several passes, for the error word “poposd”, can be “opposed”, the number of OCR errors starts by exponentially getting “proposed”, “pops”, and “popes”. In point of fact, several reduced. non-trivial dictionary-based error correction algorithms exist, one of which is the string matching algorithm that 3.3 Context-Based Error Correction weights the words in a text using a distance metric Hypothetically, dictionary-based error correction representing various costs. The correction candidate with techniques are reasonably plausible and successful. the lowest distance with respect to the misspelled word is However, they are unable to correct errors based on their the best to fit as a correction [13]. Another algorithm [14] context, i.e. correcting errors based on their grammatical demonstrated that using the language syntactic properties occurrence in the sentence. Context-based error correction and the n-gram model can speed-up the process of techniques, on the other hand, perform error detection and generating correction candidates and ultimately picking up correction based on the error grammatical and sometimes the best matching candidate. [15] proposed an OCR post semantic context. This would solve the previous dilemma error correction method based on pattern learning, wherein of correcting real-word errors such as in the sentence a list of correction candidates is first generated from a “How is you day”, because according to the context in lexicon, then the most proper candidate is selected as a which “you” has occurred, it is unlikely to have a personal correction based on the vocabulary and grammar pronoun followed by a noun, rather, it is more likely to characteristics surrounding the error word. [16] proposed a have a possessive pronoun followed by a noun. statistical method for auto-correction of OCR errors; this In order to bring context-based error correction into approach uses a dictionary to generate a list of correction practice, several innovative solutions were considered, the candidates based on the n-gram model. Then, all words in majority of them are grounded on statistical language the OCR text are grouped into a frequency matrix that models (SLMs) and feature-based methods. [23] described identifies the exiting sequence of characters and their a context-sensitive word-error correction system based on count. The correction candidate having the highest count confusion mapping that uses confusion probabilities to in the frequency matrix is then selected to substitute the identify frequently wrong sequences and convert them into error word. [17] proposed an improved design that the most probable correct sequence. In other terms, it employs a clustering technique to build a set of groups models how likely one letter has been misinterpreted as containing all correction candidates. Then, several another. [24] applied a part-of-speech (POS) tagger and iterations of word frequency analysis are executed on the grammatical rules of the English language to capture these clusters to eliminate the unlikely candidate words. In real-word errors in the OCR text. For instance, one of due course, only a single candidate will survive to replace these rules states that a verb can be followed by a gerund the misspelled word. [18] proposed the use of a topic object but it cannot be followed by a second verb, while model to correct the OCR output text. It is a global word another rule states that a third person verb in the present probability model, in which documents are labeled with a tense must always take an “s”. The aggregate of these semantic topic having a specific independent vocabulary rules drove the logic of the algorithm and achieved a distribution. In other words, every scanned document is reasonable context-based OCR error correction. [25] used semantically classified according to its topic using word trigrams to capture and correct non-word and real- unsupervised training model. Every misspelled word is word errors. The idea is to use a combination of a lookup then corrected by selecting the correction candidate that dictionary to correct non-word errors, and a statistical belongs to the same class of the actual error. [19] proposed model to correct real-word errors according to their a divergent approach based on syntactic and semantic context. [26] proposed a Bayesian classifier that treats the correction of OCR errors; the idea pivots around the real-word errors as ambiguous, and then tries to find the analysis of sentences to deduce whether or not they are actual target word by calculating the most likely candidate syntactically and semantically correct. If a suspicious based on probabilistic relationships between the error and sentence is encountered, possible correction candidates are the candidate word. [27] joined all the previous ideas into generated from a dictionary and grouped top-down with a concrete solution; it is a POS tagger enhanced by a word respect to their strongest syntactic and semantic trigram model and a statistical Bayesian classifier constraints. In the long run, the candidate on the top of developed to correct real-word errors in OCR text. each group is the one that substitutes the corresponding Overall, the mixture of these techniques hugely improved OCR error. [20] proposed the idea of using a Hidden the OCR post-processing error correction rate. Markov Model (HMM) to integrate syntactic information into the post-processing error correction. The suggested model achieved a higher rate of error correction due to its 4. LIMITATIONS OF DICTIONARY- statistical nature in selecting the most probable candidate BASED ERROR CORRECTION for a particular misspelled word. [21] introduced an Although dictionary-based error correction techniques intelligent autonomic model able of self-learning, self- are easy to implement and use, they still have various configuring, and self-adapting. The idea behind it is that as limitations and drawbacks that prevent them from being the system operates, as its ability to self-find and self- the perfect solution for OCR post-processing error correct errors increases. [22] proposed a blend of post- correction. processing tools that help fight against spelling errors. In The first limitation is that dictionary-based approach this method, the OCR text is sent through a series of filters requires a wide-ranging dictionary that covers every single with the intention of correcting misspellings via multiple word in the language. For instance, the Oxford dictionary Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf [28] embraces 171,476 words in current use, and 47,156 engine so that it gets processed. In case the query contains obsolete words, in addition to their derivatives which a misspelled word, Google will suggest a possible count around 9,500 words. This suggests that there is, at correction via its “did you mean” feature. Consequently, the very least, a quarter of a million distinct English this spelling suggestion is to be considered as a correction words. Besides, spoken languages may have one or more for the misspelled query. varieties each with dissimilar words, for instance, the German language has two varieties, a new-spelling 5.1 Inner Workings of Google’s Spelling Suggestion variance and an old-spelling variance. Likewise, the According to [31, 32], Google‟s spelling suggestion Armenian language has three varieties each with a number system can suggest an alternative correct spelling for the of deviating words: Eastern Armenian, Western Armenian, often made typos, misspellings, and keyboarding errors. and Grabar. The Arabic language also follows the same Under the hood, Google has a titanic database of millions norm as it has many assortments and dialects that diverge of public web pages containing trillions of term collections broadly from country to country, from region to region, and n-gram words that can be used as groundwork for all and from era to era [29]. For instance, the ancient Arabic kinds of linguistic applications such as machine language that was used before 600 A.D. in the north and translation, speech recognition, and spell checking, as well south regions of the Arabian Peninsula is totally different as other types of text processing problems. Inherently, from the classical Arabic that is being used in the present- Google‟s spelling suggestion scheme is based on the day. Therefore, it is obvious that languages are not probabilistic n-gram model originally proposed by [33] for uniform, in a sense that they are not standardized and predicting the next word in a particular sequence of words. thereby cannot be supported by a single dictionary. In short, an n-gram is simply a collocation of words that is The second limitation is that regular dictionaries n words long, for instance, “The boy” is a 2-gram phrase normally target a single specific language and thus they also referred to as bigram, “The boy scout" is a 3-gram cannot support multiple languages simultaneously. For phrase also referred to as trigram, “The boy is driving his instance, the Oxford and the Merriam–Webster car” is a 6-gram phrase, and so forth. Google‟s algorithm dictionaries only target the English language. The automatically examines every single word in the search Larousse dictionary targets the French language, while the query for any possible misspelling. It tries first to match Al Munjid dictionary targets the Arabic language. the query, basically composed of ordered association of Henceforth, it is unquestionably impossible to create an words, with any occurrence alike in Google‟s index international dictionary pertaining to all languages of the database. If the number of occurrence is high, then the world. query is considered correct and no correction is to take The third limitation is that conventional dictionaries do place. However, if the query was not found, Google tries not support proper and personal names, names of to infer the next possible correct word in the query based countries, regions, geographical locations and historical on its n-gram statistics deduced from Google‟s database of sites, technical keywords, domain specific terms, and indexed webpages. In due course, an entire suggestion for acronyms. For instance, an ordinary dictionary could the whole misspelled query is generated and displayed to falsely detect “Thomas Jefferson”, “Machu Picchu”, and the user in the form of “did you mean: spelling- “São Tomé” as incorrect words. Similarly, scientific suggestion”. For example, searching for the word terminologies such as “RAM”, “CPU”, and “pixel”, and “conputer” drives Google‟s search engine to suggest “did names of diseases such as “AIDS”, “Hypothermia”, and you mean: computer”. Likewise, searching for “conputer „Malaria” could incorrectly be detected as misspellings. In on the tesk” drives Google‟s search engine to suggest “did total, it is nearly unviable to compile a universal dictionary you mean: computer on the desk”. Also trying to search with words from all existing domains and disciplines. for the proper name “maw tsi toung” drives Google‟s The fourth and last limitation is that the content of a search engine to suggest “did you mean: mao tse tung”. standard dictionary is static in a way that it is not Figure 1-3 show the different suggestions returned by constantly updated with new emerging words unless Google‟s search engine when searching for the misspelled manually edited, and thus, it cannot keep pace with the queries “conputer”, “conputer on the tesk”, and “maw tsi immense dynamic breeding of new words and terms. toung” respectively. For all the above reasons, attaining better OCR post- processing dictionary-based error correction fallouts, greatly require finding a universal, multi-language, and dynamic dictionary embracing a colossal volume of entries, words, terms, proper nouns, expressions, jargons, and terminologies that possibly could occur in a text. Figure 1: Spelling suggestion for “conputer” 5. PROPOSED SOLUTION This paper proposes a new post-processing method and algorithm for OCR error correction based on the “did you mean” spelling suggestion feature of Google‟s online web search engine [30]. The idea centers on using Google‟s Figure 2: Spelling suggestion for “conputer on the tesk” massive indexed data to detect and correct misspelled words in the OCR output text. The algorithm starts first by chopping the OCR text into several tokens of words. Then, each token is sent as a search query to Google‟s search Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf and hence, ci is extracted from “did you mean: ci ” and appended to a text file C which will eventually hold the entire correct blocks C={ c0, c1, c2…cn }. In contrast, if the search results did not contain any “did you mean” Figure 3: Spelling suggestion for “maw tsi toung” expression, then the block bi is said to contain no misspelled words, and thus, the original bi is added intact 5.2 Proposed Solution Specifications to the text file C. Ultimately, when all blocks get The proposed solution is a context-based error validated, the OCR post-processing stage finishes and the correction algorithm built on Google‟s spelling suggestion algorithm halts. The text file C holding the complete technology, and meant to be integrated into OCR systems corrected OCR text can now be safely handled as a post-processor to detect and correct non-word and appropriately, i.e. printed, saved, or edited using a word real-word errors. Specifically, and after the image processor. Figure 5 is a flowchart summarizing the various document has been scanned and digitally processed to computational steps of the proposed algorithm, executed produce computer text, the proposed algorithm breaks to detect and correct misspellings in OCR text. down the OCR recognized output text into a collection of blocks, each containing five words (Five words were chosen just to provide Google with enough insights about the context of every block). These blocks are fed one by one to Google‟s search engine as search parameters. If Google returns a successful response without the “did you mean” expression, then it is evident that the query contains no misspelled words and thus no correction is needed for this particular block of words. Contrariwise, if Google responds with a “did you mean: spelling-suggestion” expression, then definitely the query contains some misspelled words and thus a correction is required for this particular block of words. The actual correction consists of replacing the original block in the OCR output text by the Google‟s alternative suggested correction. Figure 4 is a variation of the generic OCR system proposed by [4], however, upgraded with an additional post-processing layer using the proposed error correction algorithm. Figure 5: The executional steps of the proposed algorithm 5.4 The Pseudo-Code The following pseudo-code is a high-level computing platform-independent description of the entire logic behind the proposed algorithm. // the purpose of this function is to correct the spelling errors in the OCR output text using the Google‟s spelling suggestion feature. // INPUT: OCR plain text received from the previous character recognition stage. Figure 4: OCR system enhanced with the proposed algorithm // OUTPUT: Corrected OCR text containing the least possible misspellings. 5.3 The Proposed Algorithm BEGIN The proposed algorithm comprises several steps to be executed in order to correct OCR misspellings. The Function Post-Correction (ocr_text) { algorithm takeoffs by dividing the OCR output text B into // breaks the ocr_text into block of 5 words each a series of blocks b0, b1, b2…bn each made out of five B  Tokenize(ocr_text , 5) words. Subsequently, blocks in B={ b0, b1, b2…bn } are sent sequentially, one by one to Google‟s online web for (i0 to N) // iterates until all tokens are exhausted search engine as a search query parameters. The search { // sends every B[i] to Google search engine results returned by Google are then parsed to identify result  GoogleSearch(B[i]) whether or not they contain a “did you mean: ci ” expression, where ci is the spelling suggestion for block bi . if(result contains “did you mean”) { If true, then the block bi must contain a misspelled word, Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf // indicates some misspellings in B[i] Table I: // parses the search results, extracts the suggestion The results of performing OCR on the English document C  ExtractCorrection(result) } else // and appends it to the output file C C  B[i] If yotrve gust updated a Windows 95 symtem to Window 98, an // no misspellings were found so add the original B[i] to C important TWAlN driver file night have been replaced. Use the } MicroSeoft windows 98 Verslon confliqt Mmage ,(VCM) to gheck fmr a changed file celled TWAIN.DLL, and repiace the one RETURN C // file C now holds the complete corrected OCR text installed during Windows 98 upgrade with the one yu were using } (that worked!). With newer hut-swap technralngies, such as USB and lEEE -1394, END make sure that your sysilem is ready for the scanner by aoing the following; 1. Enable the USB port, or install an [EEE-1394 or a USB card. if you need to install a USB card, I The function Post-Correction() contains one for loop reccommend a USB 2.o - compatihle card. that is executed n times, where n is the total number of 2. Use an operating system that supports the port type. Windows 95 Me 2000 XP are required for IEIiE-1394 and USB devices wmrk tokens in the OCR text. Considering “result  best with these versions of Windows, although late releases of GoogleSearch(B[i])” as the basic operation, the time Windows 95 do support some ESB devices. complexity of the algorithm is described as follows: n Table II: ∑ 1 =n and thus the algorithm is of time complexity O(n) The results obtained after applying the proposed error correction i=0 algorithm Since the basic operation is to be executed n times irrespectively of the content of the input OCR text, If you‟ve just updated a Windows 95 system to Window 98, CBest(n)= C Worst(n)= CAverage(n)= n an important Twain driver file might have been replaced. Use the Microsoft windows 98 Version conflict Mage 6. EXPERIMENTS & RESULTS ,(VCM) to check for a changed file called TWAIN.DLL, and replace the one installed during Windows 98 upgrade In the experiments, OCR was performed on two low- with the one you were using (that worked!). quality image documents, each in a different language: With newer hut-swap technologies, such as USB and IEEE English [34] and Arabic [35]. Google.com was used to -1394, make sure that your system is ready for the scanner spell-check the English document, while Google.ae was by using the following; used to spell-check the Arabic document. Additionally, 1. Enable the USB port, or install an IEEE-1394 or a USB one of the most renowned proprietary OCR software card. if you need to install a USB card, I recommend a USB solutions, the OmniPage version 17 by Nuance 2.o - compatible card. Communications [36] with English and Arabic support 2. Use an operating system that supports the port type. Windows 95 Me 2000 XP are required for IEEE 1394, and was utilized to carry out the OCR process. The proposed USB devices work best with these versions of Windows, algorithm was implemented using MS C# 4.0 under the although late releases of Windows 95 do support some USB MS .NET Framework 4.0 and the MS Visual Studio 2010. devices. Figure 6 shows the original English document to be processed. The subsequent Table I delineates all misspellings (underlined) generated during the OCR The OCR text delineated in Table I comprehended 27 process. Next is Table II, which outlines the same OCR misspelled words out of 126 (the total number of words in text of Table I, however, error-corrected using the the whole original text), making the error rate close to E = proposed OCR post-processing error correction algorithm. 27/126 = 0.214 = 21.4%. Several of these errors were proper names such as “Microsoft” and “IEEE”, while others were technical words such as “USB” and “TWAIN”. The remaining errors were regular English words such as “recommend”, “compatible”, “work”, “might”, etc. Table II exposed the results of post- processing the OCR output text in Table I using the proposed error correction algorithm. 23 misspelled words out of 27 were corrected, leaving only 4 non-corrected errors, and they are “Mmage” which was falsely corrected as “Mage”, “agoing” as “using”, whereas “Hut” and “2.o” were not corrected at all. As a result, the error rate using Figure 6: Low-quality English image document the proposed algorithm dropped to E = 4/126 = 0.031 = 3.1%. Consequently, the improvement can be calculated as I = 0.214/0.031 = 6.90 = 690%, that is increasing the error correction rate by a factor of 6.9. The following is the Arabic document to be processed and tested. Figure 7 depicts the original document, while Table III delineates the OCR results along with the Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf numerous misspelled words that were generated. Table IV English text and 403% for Arabic text. In other words, 6.9 shows the results of post-processing the OCR output text times more English errors and 4 times more Arabic errors of Table III using the proposed error correction algorithm. were detected and corrected. On average, the proposed algorithm improved the error correction rate by I= (609% + 403%) / 2 = 506%, that is increasing the overall error correction rate by a factor of 5.06. Table V roughly sketches the head-to-head OCR experimental results between the proposed algorithm and the OmniPage software suite. Table V: Head-to-head comparison between the proposed algorithm and the OmniPage suite English Arabic Document Document Figure 7: Low-quality Arabic image document Total words = Total words = 126 64 Table III: Number of errors The results of performing OCR on the Arabic document resulted from using 27 8 OmniPage 17 Number of errors resulted from using the 4 2 .ً‫ عهى اساس دسخىر فدران‬7171 ‫قايج ا َدونت االححادٌت انًركر ٌت انعاو‬ proposed algorithm ‫ فإذا كاٌ َصانق انعاو‬.‫وحطىّر االححاد انفدرانً األيٍركً حض ىرا كبٍرا‬ OmniPage error rate 21.4% 12.5% ‫ وكاٌ نحرب‬.ٍٍ‫ فقد بهغ انٍىو خًسٍٍ والٌت ويقاطعخ‬،‫ والٌت‬71 ٍ‫ ي‬7171 Proposed algorithm 3.1% 3.1% ‫ َىر سافر فً حىسٍع رقع ى‬،‫ ونُخائجها‬، 7781 ‫االَفصال انخً اَدنعج انعاو‬ error rate ‫ وٌبهغ انٍى عى عدد سكاٌ انىالٌاث انًخّحدة‬.‫االححاد نٍشًم والٌاث جدٌدة‬ Proposed algorithm 2578750 ‫ يهٍىٌ َسًت يُخشرٌٍ عهى يسًاحاث جغرافٍت حبهغ‬051 ً‫حىان‬ 6.9 (690%) 4.03 (403%) improvement ratio ‫كهى‬ 8. CONCLUSIONS Table IV: This paper presented a new post-processing technique The results obtained after applying the proposed error correction for OCR error detection and correction based on Google‟s algorithm online spelling suggestion. Since Google is a giant warehouse of indexed real-world pages, articles, blogs, forums, and other sources of text, it can suggest common spellings for words not found in standard dictionaries. The .ً‫ عهى اساس دسخىر فدران‬7171 ‫قايج اندونت االححادٌت انًركزٌت انعاو‬ proposed algorithm exploited Google‟s “did you mean” 7171 ‫ فإذا كاٌ َصف انعاو‬.‫وحطىّر االححاد انفدرانً األيٍركً حطىرا كبٍرا‬ technology with the purpose of using query spelling ‫ وكاٌ نحرب االَفصال‬.ٍٍ‫ فقد بهغ انٍىو خًسٍٍ والٌت ويقاطعخ‬،‫ والٌت‬71 ٍ‫ي‬ suggestions to correct non-word and real-word errors in ‫ َىر سافر فً حىسٍع رقعت االححاد‬،‫ ونُخائجها‬، 7781 ‫انخً اَدنعج انعاو‬ OCR output text. Experiments undertaken showed a sharp 051 ً‫نٍشًم والٌاث جدٌدة وٌبهغ انٍىو عدد سكاٌ انىالٌاث انًخّحدة حىان‬ ‫ كهى‬2578750 ‫يهٍىٌ َسًت يُخشرٌٍ عهى يساحاث جغرافٍت حبهغ‬ improvement in OCR error correction rate as higher number of misspellings and linguistic errors were detected and corrected using the proposed method compared with other traditional existing ones. The OCR text delineated in Table III comprehended 8 misspelled words out of 64 (the total number of words in the whole original text), making the error rate close to E = 9. FUTURE WORK 8/64 = 0.125 = 12.5%. The majority of these errors were As further research, the proposed algorithm can be re- regular Arabic words such as “‫”اندونت‬, “‫”انًركزٌت‬, “‫”يساحاث‬, designed to support multiprocessing platforms so as to etc. Table IV exposed the results of post-processing the operate in a parallel fashion over a bunch of concurrent OCR output text in Table III using the proposed error processors or even over a bunch of distributed computing correction algorithm. 6 misspelled words out of 8 were machines. The expected results would be a faster corrected, leaving only 2 non-corrected errors, and they algorithm of time complexity O(n/m), where n is the total are “‫ ”َصانق‬which was falsely corrected as “‫”َصف‬, and number of word tokens to be spell-checked and m is the “‫ ”َىر‬was not corrected at all. As a result, the error rate total number of processors. using the proposed algorithm dropped to E = 2/64 = 0.031 = 3.1%. Consequently, the improvement can be calculated ACKNOWLEDGMENTS as I = 0.125/0.031 = 4.03 = 403%, that is increasing the error correction rate by a factor of 4. This research was funded by the Lebanese Association for Computational Sciences (LACSC), Beirut, Lebanon 7. EXPERIMENTS EVALUATION under the “Web-Scale OCR Research Project – The experiments conducted on the proposed OCR post- WSORP2011”. processing error correction algorithm clearly revealed an error detection and correction improvement of 690% for Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf REFERENCES [17] Kazem Taghva, Julie Borsack, Allen Condit, Results of Applying Probabilistic IR to OCR Text, proceedings of the [1] The Boston Public Library, (January 19, 2010), seventeenth annual international ACMSIGIR conference on http://www.bpl.org/general/about/bpl_an_overview_2010.p Research and development in information retrieval, (1994). df. [18] Michael L. Wick, Michael G. Ross, Erik G. Learned-Miller, [2] L. Vincent, Google Book Search: Document Understanding Context-Sensitive Error Correction: Using Topic Models to on a Massive Scale, Proceedings of the Ninth International Improve OCR, In the proceedings of the 9th International Conference on Document Analysis and Recognition Conference on Document Extraction and Analysis, page(s): (ICDAR), pp. 819-823, (2007). 1168 - 1172, ISSN: 1520-5363, (2007). [3] S. T. Klein and M. Kopel, A Voting System for Automatic [19] K. Kise, T. Shiraishi, S. Takamatsu, K. Fukunaga, A OCR Correction, Proceedings of the SIGIR 2002 Workshop method of Post-Processing for Character Recognition based on Information Retrieval and OCR, (2002). on Syntactic and Semantic Analysis of Sentences, Journal of Systems and computers, Japan, ISSN 0882-1666, [4] Mohamed Cheriet, Nawwaf Kharma, Cheng-Lin Liu, Ching CODEN SCJAEP, vol. 27, no9, pp. 94-107 (1996). Y. Suen, Character Recognition Systems: A Guide for Students and Practitioners, Wiley-Interscience Publication, [20] J. Hull, Incorporating Language Syntax in Visual Text (2007). Recognition with a Statistical Model, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(12), (1996). [5] Jurafsky D., Martin J., Speech and Language Processing, 2nd ed, Prentice Hall, (2008). [21] Wei Sun, Lon-Mu Liu, Weining Zhang, John Craig Comfort, Intelligent OCR Processing, Journal of the [6] Tong and Evans, A Statistical Approach to Automatic OCR American Society for Information Science, volume 43, issue Error Correction in Context, In Proceedings of the fourth 6, pages 422-431, (1992). Workshop on Very Large Corpora, 88-100, Copenhagen, Denmark, (1996). [22] Eugene Borovikov, Ilya Zavorin, Mark Turner, A Filter Based Post-OCR Accuracy Boost System, Conference on [7] Roger T. Hartley, Kathleen Crumpton, Quality of OCR for Information and Knowledge Management Proceedings of Degraded Text Images, DL '99 Proceedings of the fourth the 1st ACM workshop on Hardcopy document processing ACM conference on Digital libraries, New York, USA, Pages: 23-28, (2004). (1999). [23] Evans D., Zhai C., Tong X., Milic-Frayling, N., OCR [8] David J. Ittner, David D. Lewis Y, David D. Ahn Z, Text Correction and Query Expansion for Retrieval on OCR Categorization of Low Quality Images, In Proceedings of Data, Clarit trec-5 confusion track report, volume 52, SDAIR-95, 4th Annual Symposium on Document Analysis (1996). and Information Retrieval, (1995). [24] Atwell E., Elliittm S., Dealing with ill-formed English Text, [9] K. Kukich, Techniques for Automatically Correcting Words The Computational Analysis of English: A Corpus-Based in Text, ACM Computing Surveys, vol.24, no.4, pp.377-439, Approach, New York, Longman, (1987). (1992). [25] Mays E., Damerau F.J., Mercer R.L, Context-Based [10] Charles Franks, Distributed Proofreaders (DP), (2000), Spelling Correction, Information Processing and http://www.pgdp.net/c/ Management, 27, 5, 517-522, (1991). [11] Marie Lebert, Project Gutenberg (1971-2008), E-book, [26] Golding R.A., A Bayesian Hybrid Method for Context- (2008). Sensitive Spelling Correction, Proceedings of the Third Workshop on Very Large Corpora, Cambridge, MA. 39-53, [12] M. Bokser, Omnifont Technologies, Proc. of the IEEE, (1995). vol.80, no.7, pp.1066-1078, (1992). [27] Golding R.A., and Schabes Y., Combining Trigram-Based [13] Levenshtein, V.I., Binary codes capable of correcting and Feature-Based Methods for Context-Sensitive Spelling, deletions, insertions, and reversals, Cybernetics and Control pages 71–78, 34th Annual Meeting of the Association for Theory, 10(8), 707-710, (1966). Computational Linguistics, (1996). [14] I. Guyon, F. Pereira, Design of a Linguistic Postprocessor [28] The Oxford Dictionary, http://www.oxforddictionaries.com Using Variable Memory Length Markov Models, in Proc. 3rd Int. Conf. Document Analysis and Recognition, pp.454- [29] Kees Versteegh, The Arabic Language, New York, 457, Montreal, Canada, (1995). Columbia University Press, (1997). [15] Hisao Niwa, Kazuhiro Kayashima, Yasuham Shimeki, [30] Google Online Web Search Engine, http://www.google.com Postprocessing for Character Recognition Using Keyword Information, IAPR Workshop on Machine Vision [31] Google Spell Checker Guide (2007), Applications, Dec. 7-9, Tokyo, (1992). http://www.googleguide.com/spelling_corrections.html [16] Lon-Mu Liu, Yair M. Babad, Wei Sun, Ki-Kan Chan, [32] Thorsten Brants, Alex Franz, Web 1T 5-gram Version 1, Adaptive Post-Processing of OCR Text via Knowledge Linguistic Data Consortium, Philadelphia, PA, (2006). Acquisition, Proceedings of the 19th annual conference on Computer Science, (1991). Journal of Emerging Trends in Computing and Information Sciences, ISSN 2079-8407, Vol. 3, No. 1, January 2012 http://www.cisjournal.org/journalofcomputing/archive/vol3no1/vol3no1_7.pdf [33] Markov, A.A., Essai d‟une Recherche Statistique Sur le Texte du Roman “Eugène Oneguine”, Bull. Acad. Imper. Sci. St. Petersburg, 7, (1913). [34] Scott Mueller, Upgrading and Repairing PCs, 16th ed, Que publishers, (2004). [35] Fouad Bitar, ‫ انقاَىٌ ان ّدسخىري وانًؤسساث انسٍاسٍت‬, Paulist Press, (1996). [36] OmniPage OCR Solution, http://www.nuance.com/for- business/by-product/omnipage/index.htm

(PDF) OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion