(PDF) Synchronous Multi-Stream Hidden Ma

Author’s Accepted Manuscript Synchronous Multi-Stream Hidden Markov Model for Offline Arabic Handwriting Recognition without Explicit Segmentation Khaoula Jayech, Mohamed Ali Mahjoub, Najoua Essoukri Ben Amara www.elsevier.com/locate/neucom PII: DOI: Reference: S0925-2312(16)30763-9 http://dx.doi.org/10.1016/j.neucom.2016.07.020 NEUCOM17375 To appear in: Neurocomputing Received date: 30 August 2015 Revised date: 30 April 2016 Accepted date: 11 July 2016 Cite this article as: Khaoula Jayech, Mohamed Ali Mahjoub and Najoua Essoukri Ben Amara, Synchronous Multi-Stream Hidden Markov Model for Offline Arabic Handwriting Recognition without Explicit Segmentation, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.07.020 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Synchronous Multi-Stream Hidden Markov Model for Offline Arabic Handwriting Recognition without Explicit Segmentation Khaoula Jayech*, Mohamed Ali Mahjoub, Najoua Essoukri Ben Amara SAGE Research Unit, National Engineering School of Sousse, University of Sousse, Tunisia *

[email protected]

Abstract Arabic handwriting recognition is still a challenging task due especially to the unlimited variation in human handwriting, the large variety of Arabic character shapes, the presence of ligature between characters and overlapping of the components. In this paper, we propose an offline Arabic-handwritten recognition system for Tunisian city names. A review of the literature shows that the Hidden Markov Model (HMM) adopting the sliding window approach are the mainly used models, which gives good results when a relevant feature-extraction process is performed. However, these models are utilized especially to model one dimensional signal. Consequently, to model bidimensional signals or multiple features, a solution based on combining multi-classifiers and then a post-treatment selecting the best hypothesis is applied. The problem considered in this case consists in searching the best way to combine the contribution of these classifiers. In this study, we put forward an extension of the HMM, which can surmount this problem. Our proposed system is based on a synchronous multi-stream HMM which has the advantage of efficiently modelling the interaction between multiple features. These features are composed of a combination of statistical and structural ones, which are extracted over the columns and rows using a sliding window approach. In fact, two word models are implemented based on the holistic and analytical approaches without any explicit segmentation. In the first approach, all the classes share the same architecture nevertheless, the parameters are different. In the second approach, each class has its own model by concatenating their components models. The results carried out on the IFN/ENIT database show that the analytical approach performs better than the holistic one and that the data fusion model is more efficient than the state fusion model. Keywords: Multi-Stream Hidden Markov Model, Implicit segmentation, Arabic Handwriting Recognition, Dynamic Bayesian Network. 1. Introduction In the last years, a great interest has been devoted to offline Arabic handwriting recognition, which has become a very popular research topic. It has been applied in many fields, including education, finance, government agencies, healthcare, legal industry and banking by form editing and processing, automatic sorting of postal mail, bills processing, passport validation, check processing and, more recently, text to speech applications, helping blinds to read and recognizing handwritten historical documents to facilitate their indexation and storage [1, 2, 3, 4]. Generally, Arabic handwriting recognition consists in developing systems that are able to convert human writing into text. According to the literature, there have been several approaches which deal with printed text lines, words and isolated characters. To recognize these latter, different structural, statistical and stochastic approaches have shown their effectiveness [5]. However, handwritten word recognition is still a challenging task. Thus, to 1 surmount this problem, two main approaches have been suggested in the literature: the holistic and the analytical approaches. In the holistic approach, text components are recognized as a whole without segmenting the text into smaller units. This approach is easy to implement, however, it is sensitive to noise, small variations of the text, and therefore, it can be operational for small lexicons. Whereas, when dealing with a large lexicon, the analytical approach is, generally, more efficient. In fact, the text components are further segmented into smaller units known as graphemes and components are represented in the form of structure of these units. This approach carries certain advantages over holistic one. First, it is less sensitive to noise and small variations of the grapheme. Second, to overcome the insufficient training data problem, component (sub-word, character, sub-character, stroke, etc) model training can be applied and the class models are performed by combining their component-class models. Third, the vocabulary can be easily enlarged and new class models can be efficiently added by a components-class model concatenation, so, it has ability to accommodate large number of classes. However, when dealing with Arabic handwriting, the segmentation process is very challenging and difficult task for many reasons related to the morphological complexity of the Arabic script which is summarized by the following [1, 6]:  The variation in human handwriting styles is unlimited. Therefore, the characters written by different people representing the same character are not identical but can vary in both size and shape (Fig.1.a). 2  Multiple characters can be combined vertically to form a ligature (Fig.1.b).  Some Arabic characters have diacritic marks (a diacritic may be placed above or below the body of the character). These diacritics can be overlapped (Fig.1.c).  An Arabic word usually consists of one or more connected components (sub-words), and each one contains one or more characters that can be overlapped with other characters or diacritics (Fig.1.d).  Each character can take up to four different forms (beginning, middle, end, isolated) depending on its position in the (sub-) word (Fig.1.e). Fig. 1. Complexities in Arabic handwriting recognition A recent survey on Arabic handwritten text recognition was presented in [1, 2, 3]. The challenging nature of Arabic script recognition has impressed the attention of researchers from both industry and academic circles but these efforts have not achieved good results until now. In fact, many artificial intelligence techniques such as the neural network [5], the Support Vector Machine (SVM) [4] and the recurrent neural network [6] have been proposed in the literature. However, these techniques have pros and cons. In fact, Laurrence et al. [4] proved that the SVM classifier are less robust to degradation than the HMM classifier in case of highly broken characters. Thus, to model and absorb the distortion and the high variability of handwriting, a lot of works have been developed, based especially on HMMs that have become an effective paradigm for modelling stochastic processes and pattern sequences [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. The standard HMM is a statistical model in which the modelled system is supposed to be a Markov process with unknown parameters, and the problem is to determine the hidden parameters from the observable ones [9]. It is an efficient tool for 3 modelling and classifying one-dimensional signals. However, they can be inappropriate in the following cases: modelling a lot of information streams coming from different sources [17, 18], modelling the interdependency that can exist between variables, reducing and simplifying the system complexity by extracting the repeated components, etc. In order to surmount these limits and to effectively handle the multiple signals and information streams, many approaches based on multi-stream models that focus on taking benefit of complementary information have been developed. These streams can be multi-model, such as audio and visual information in speech recognition or simply a different set of features extracted from the same pattern in a different condition. The main problem is to find the best way to fuse multiple features in multi-stream modelling. This fusion of features can be made in different ways [15, 16] but the most cited strategies in the literature are: feature fusion or direct identification (known as early integration), decision fusion or separate identification (known as late integration), and model fusion (early/intermediate integration). Fig. 2. Fusion levels In the feature fusion (see Fig.2), feature vectors can be concatenated and the resulting features are used by a conventional HMM [17]. This approach has the disadvantage of treating equally-important heterogeneous features. In addition, it is not able to model timing synchronicity between different features [18]. Yet, in the decision fusion (Fig.2), different HMMs are trained using multiple features and the outputs of each classifier are merged to 4 decide the final classification [13, 14]. This approach does not handle the correlation between features and enable modelling the synchronicity between the streams. Furthermore, it needs a lot of calculation because it implies two decision layers [18]. Nevertheless, in the model fusion (Fig.2), an extension of HMM is built up in order to represent and model the correlation between features and the loose synchronicity between streams. In this context, several extensions of the HMM have been put forward, like the coupled HMM, the multistream HMM [4], the hierarchical HMM [19, 20], the factorial HMM, the planar HMM [21], etc [22] (Fig.3). However, the reader is invited to refer to [22] in order to know more details about these extensions, including their types, history, motivations for use, characteristics, and applications. These extensions are considered as a subclass of Bayesian Networks (BNs) known as Dynamic Bayesian Networks (DBNs), which are a class of temporal graphical probabilistic models that have become a standard tool for modelling various stochastic time varying phenomena. The advantages of DBNs include the ability to represent stochastic model, to incorporate prior knowledge, and to handle hidden variables and missing data in a principled way. In fact, Mahjoub et al. [23] have applied the coupled HMM models for Arabic offline handwriting recognition and also Laurrence et al. [4] have used these models for degraded character recognition and showed that state-coupled and auto-regressive coupled HMMs are more robust to degradation than the SVM classifier in case of highly broken characters. Hierarchical, planar, factorial and coupled HMM structures assign a state sequence to each stream and allow asynchrony between sequences. A Multi-Stream HMM (MSHMM) is an extension of an HMM-based structure that can deal with many feature streams for temporal data [24, 25]. Two types can be distinguished: asynchronous MSHMMs presented by Kessentini et al. in [16] for Arabic handwriting recognition which computes low level feature streams and combine them at a character level and synchronous MSHMM which is widely used in gesture recognition and, as far as we know, these models have not been used in handwriting recognition. The MSHMMs were applied when the features (streams) are synchronous and independent. 5 Fig. 3. Extension of HMMs: (a) Planar [21] (b) State coupled [4] (c) Asynchronous multi-stream [16] (d) Synchronous multi-stream (e) Coupled [23] (f) Hierarchical [20], HMM 1.1 Motivation and contributions In this work, we propose a new approach based on the synchronous MSHMM to recognize offline Arabic handwriting which is considered until now a very challenging 1 and active research topic. This model provides many advantages due to its properties and characteristics. First, the multi-stream paradigm gives an interesting and efficient framework for the integration of multiple sources of information. Thus, the MSHMM can learn from multiple features in order to achieve a powerful discrimination. Second, the topology of the MSHMM may be adapted to each source of observation and so, may handle multiple modalities for temporal data interaction between two or more streams of continuous or/and discrete observations in a synchronous or asynchronous manner. Third, when there are many observation symbols, it significantly minimizes the number of parameters. As a result, a reduction in the number of samples is required to learn the model. Finally, it opens the door for selecting relevant features. The present paper is an extension and an improvement of a previous work [28] based on a data fusion MSHMM. The main contributions of this work can be summarized as follows: (i) Two synchronous multi-stream HMM; data and state fusion models have been proposed and applied for the first time to Arabic handwriting recognition. (ii) To avoid the difficulty of Arabic handwriting segmentation, a sliding window approach has been used to extract the feature. (iii) Also, the holistic and analytical approaches have been developed based on two MSHMMs for Arabic handwriting recognition. 1 Many competitions have been organized in the international conference ICDAR and ICFHR in order to encourage the researchers and so make a great advance in this field. 6 The rest of the paper is organized as follows. In the second section, the main recent related works for offline Arabic handwriting recognition is reviewed. In the third section, the definition of the dynamic Bayesian network is recalled and the formalism of the MSHMM is detailed: modelling, training and recognition steps. In the fourth section, the diagram of the proposed recognition system is described. In the fifth section, the realized experimentations to adjust the model to data and discuss the obtained results are presented. Finally, a conclusion and some perspectives for future works are given. 2. Related works In this section, a brief state of the art is exposed concerning the recent works on multiscripter Arabic off-line handwriting recognition using specially the IFN/ENIT database. Ramy El Hajj et al. in [14] developed an offline Arabic handwritten recognition system based on the combination of three HMM classifiers. The baseline-independent features and baselinedependent ones were extracted using a sliding window in three directions (-a, 0, +a) in order to cope with the problem of writing inclination, overlapping ascenders and descenders, and shifted position of some diacritical marks. Then, for each direction, an HMM classifier was put forward to classify the extracted features. After that, a combination of the results given by the three homogenous HMMs was done at the decision level. The results showed that this combination gave better results than a single HMM, and the neural network-based combination was superior. The obtained results given by the proposed model was higher than 90 percent accuracy. To efficiently model multiple features using a single classifier, a coupled model based on the HMMs, that were considered as a special case of the DBNs, was introduced for the first time by Likeforman et al. in [4] to model and recognize degraded characters; and then it was enhanced by Mahjoub et al. in [23] to recognize Tunisian city names extracted from the IFN/ENIT database. The coupled HMMs efficiently modelled the interaction between the two processes by jointly considering the two information streams. Indeed, the hidden state of each process at the time t depended on the hidden states of two processes at the time t-1. This model was tested on a corpus of IFN/ENIT database and recorded good results. To reduce the complexity of the recognition process by allowing the partial recognition, Jayech et al. in [20] developed a dynamic hierarchical Bayesian network for Arabic handwriting recognition. In fact, recognizing characters and sub-characters was easier than word recognition. Hence, after the preprocessing step, an explicit segmentation based on the smoothed vertical histogram projection and integrated into the recognition process was 7 suggested. Afterwards, a set of statistical features were extracted for each character/subcharacter using the invariant moment, such as Hu and Zernike, which are invariant to rotation, translation and scaling. After that, the characters were modelled and recognized by static Bayesian networks. Basically, the sub-characters were estimated at the lowest level of the Bayesian network, and the characters were estimated at the highest level of the Bayesian network. The overall Arabic words were processed by a dynamic hierarchical Bayesian network. This developed approach was tested using the IFN / ENIT database, where the experiment results were very promising. Structural techniques that were largely unexplored in Arabic handwriting recognition were put forward by Parvez et al. in [27]. Thus, based on the nature of Arabic writing and using a polygonal approximation algorithm, a text line segmentation procedure into words and sub/words was developed. Fuzzy polygons and a novel fuzzy polygon matching algorithm were later applied to model and recognize the Arabic word. To select the best hypotheses of a sequence of recognized characters, a dynamic programming was utilized. The obtained recognition rates were comparable to multi-classifier implementations. A system based on Bernoulli HMMs was presented by Giménez et al. in [11]. Indeed, to surmount the feature extraction process and ensure that no discriminative information was filtered, the columns of raw were directly fed into the embedded Bernoulli HMMs. This one was defined as an embedded HMM in which the emission probabilities were modelled with the Bernoulli mixtures. The column bit vectors were extracted by applying a sliding window of an adequate width in a horizontal and vertical direction. This system achieved good results on the IFN/ENIT database of Arabic handwritten words, in the order of 90 percent accuracy. In the same context, we proposed in [28] a synchronous Multi-Stream Hidden Markov Model (MSHMM) for offline Arabic handwriting word recognition. Compared to the model presented by Kessentini et al. in [16], our proposed model has the advantage of efficiently modelling the temporal interaction between multiple features with a synchronous manner. Actually, the two extracted feature streams are combined at each t time (Fig. 3.d). These features are composed of a combination of statistical and structural ones, which are extracted over the columns and rows resulting from the vertical and horizontal uniform segmentation. Then, the class models are implemented based on the holistic and analytical approaches. In the first approach, all the class models share the same architecture but the parameters are different. Nevertheless, in the second approach, each class model has its own architecture by concatenating its component-class models. The suggested model has been tested on the IFN/ENIT database and has recorded a recognition rate about 82.91%. 8 From this brief survey, it is noticeable that the main proposed approaches are based on HMMs adopting the implicit segmentation by using the technique of sliding window to avoid the problem of Arabic handwriting segmentation as explained in the introduction section. Actually, to model and recognize Arabic handwriting, two main solutions are presented. The first one searches to combine the results given by multi-classifiers [14, 12]. However, the problem considered in this case is based on finding the best way to combine the contribution of these classifiers. To overcome this problem, the second solution has been suggested and it consists in using an extension of the HMMs such as the asynchronous multi-stream HMMs [16], the coupled HMMs [4, 23], and the hierarchical HMMs [20]. In this current study, a data and state fusion models have been proposed based on a synchronous multi-stream HMMs which are all special cases of a general class of DBNs models. The success of DBNs is due to its efficient and rapid inference and learning techniques. In fact, to build a DBN, the structure learning problems and parameters must be resolved. Moreover, structure learning which try to learn the structure of the model, is often a very difficult task. Consequently, to surmount this step, we fix the structure of DBN by adopting the MSHMM. Yet, to the best of our knowledge, this is the first work that treats this type of model for offline Arabic handwriting recognition. In the next section, we first recall briefly the theoretical foundations of DBN and, then, we detail the modelling, training and classification with MSHMM. 3. Synchronous MSHMM modelling, training and classification The static BNs are probabilistic graphical models which combine the graphic and probabilistic theories [29, 30]. Given a set of random variables, O=( a BN is defined by the pair B={G, }, where: G={O, E} is an oriented acyclic graph composed of a set of nodes denoted by O and a set of arcs denoted by E, represents a set of parameters encoding the conditional probabilities of each node variable given its parents; in the case of discrete variables, these parameters represent a Conditional Probabilistic Table (CPT), while, when a node represents a continuous variable, these parameters represent a Conditional Probabilistic Distribution (CPD) [31, 32, 30]. The joint probability distribution associated to the variables corresponding to the graph node follows the generalized Bayes theory given by Eq. (1) : ( Where ( ∏ ( | ( is the parent of (1) in the graph G. The BNs are applied in many fields such as online isolated character recognition and writer's writer identification [4]. 9 The DBNs are a generalization of the BNs to model the temporal processes occurring at the discrete times t They concatenate a set of the BNs in which a causal link of a time step is added to the other. Each one contains a set of random variables that constitute both hidden and observation nodes of the process. Considering Fig.4, for each observation sequence, the BNs is repeated as many times as necessary. Fig.4-b shows an example of unrolled DBNs for an observation sequence of length T=3: the initial network is repeated T times. The model parameters are introduced by the CPTs and the CPDs. The three CPTs are the distribution of the initial state given by ( state ( | , and the distribution of the transition state ( CPDs are given by [22]. , the distribution of the conditional ( | | . Whereas, the two ) with i=1, 2. The most simple case of the DBNs is the HMM Fig. 4. (a)DBNs represented by two first slices called 2TBN (T=2); (b) DBN unrolled and represented for three slices of time T=3. In this work, multiple streams are merged into single-DBN. This merging is performed through various DBN architectures (graphical representations), which combine two basic HMMs: the vertical HMM whose inputs are the features of the image columns and the horizontal HMM whose inputs are the image rows. In the proposed modelling, the interaction between the two observations is carried out in two models. In the first model, at each time step t, a hidden state can generate two observations. This model, which can be viewed as synchronous MSHMMs, has given good results in [33] by Arriaga for gesture recognition and in other domains. For the second model, each stream has been separately modelled by an HMM, where both HMMs are interacted by another hidden layer. The parameters, the training and the recognition algorithms of the proposed MSHMMs are detailed in the next section. 10 3.1. Description of MSHMM } } and OH={ Let OV={ be two sequences of feature vectors representing features of image columns and image rows of Tunisian city name image to be recognized. The first model is composed of three processes: One is hidden and the others are observed. In fact, at each time t, the hidden state can generate two symbols and , where 1 . The model has a probability function given by Eq. (2): ∏ (OV, OH| )=∑ ∏ ( where the sum is over all possible state sequences state), { (final state), MSHMM. }, ( (2) such that (start or initial with T being the number of regular states of the Thus, for any regular state, the parameters of the first data fusion MSHMM are ( ) where denotes the initial probabilities at j, probabilities from i to j, and denotes the transition denotes the observation probabilities function at j. These parameters are given by Eq.(3): ( ( { ( { ( ( ( ∑ ( | ) | | ) ) ) ∑ } (3) ∑ ( ∑ ( ( To obtain the architecture of the second model, the two processes of the vertical and horizontal observations have been separated using an intermediate interface layer. According to [22], this model is a schematic for two “processes” (represented by single nodes), interact only with each other via an interface layer | { { { { . The parameters of the model are ( ( ( ( ( ( ∑ ( ( ( | | ( ( | ∑ | | ∑ ) ( ( ( ∑ and , both of which . In addition, ( | and ) are given by Eq.(4): (4) ( character "‫ " ﺷ‬is shown with a binary image. [ 11 ] [ ] ∑ ( ∑ ( In Fig.5, an MSHMM example for Fig. 5. Horizontal and vertical observation sequences obtained by scanning the character "‫ "ﺷ‬from right to left and from top to bottom with the structure of developed data and state fusion MSHMM. Each component-class (character or sub-character) model has a right-left topology, five hidden states and two observations generated from each hidden state. The two observations are assigned to model the vertical and horizontal frames. This frame-by-frame emission of feature vectors, in horizontal and vertical directions, aims to better model horizontal and vertical distortions at the character level and model the interaction between the two observation sequences in a synchronous manner. 3.2. Modelling 12 The proposed handwriting recognition system is built and trained at two levels. Firstly, a holistic approach has been used to extract the features from the class model representing Tunisian city name and train the whole class model (Fig.6). Fig. 6. The holistic MSHMM model for the city name "‫"قرقنة‬ Secondly, the analytical approach has been retained at the component-class level, and the class is recognized by the concatenation of its component-class models (Fig.7). 13 Fig. 7. The analytical MSHMM model for the city name "‫"قرقنة‬ The first approach consists in creating a different MSHMM for each class from the samples labeled by the class identity. Indeed, the models of all classes share the same MSHMM structure (empirically fixed), but their parameters are different and change from one class to another. For such approach, the MSHMM may suffer from poor discriminative abilities when they are a poorly estimated class model, e.g. due to insufficient training samples or insufficient complexity. In the IFN/ENIT database some classes are well represented through a few hundred of samples; whereas, other are only represented with three samples. Therefore, to surmount this problem, the second approach has occurred. It consists in following the analytical training model by performing component-class model training. Hence, the MSHMM at the class level is built from a shared embedded MSHMM at the component-class level. In order to model Arabic characters/sub-characters, 160 MSHMM components models are built up related to 28 Arabic basic characters. Character models are built in way that takes into account the different shapes of characters according to their position in the words (beginning, middle, end and isolated). Moreover, the characters with additional marks (Hamza, chedda, etc) and ligatures are modelled and added. Each entry of the lexicon is modelled by concatenating the appropriate character models. Considering Fig.7, an embedded MSHMM for the word " ‫ " قرقنة‬is shown, which is the result of concatenating the MSHMM of the character "‫"ق‬, character "‫"ﺭ‬, character "‫"ق‬, character "‫ "ﻨ‬and character "‫ "ة‬in that order. Note that the MSHMM for the different characters share the same topology and characteristics, which are empirically determined and detailed in section 5.2. 14 3.3. MSHMM Training Training the MSHMM consists in estimating its parameters, i.e. the state transition probability, the emission probability, the optimal number of the codebook, and the optimal number of the hidden states. It is performed independently, model by model, by appplying the Expectation/Maximization (EM) algorithm which is an iterative approach of maximum likelihood estimators. The EM algorithm is used in a straightforward manner to train the set of parameters in two ways: either by a whole model training or an analytical model training, using the embedded training. This approach of training has two important advantages: The first one is that the characters are modelled when being part of a word without any explicit segmentation, which is very hard in the case of an Arabic script. The second one is that this approach performs a statistical training with less training samples than in the case of wordbased models (through a cross training from images and their transcription), as shown in Fig.8. Fig. 8. Cross training of Arabic character through different word images using the data fusion model 3.4. Recognition process After feature extraction, we have proposed to process the classification of candidate Tunisian city name images (class) in two steps: Pre-classification step: After computing the features, a decision tree based on a discrete feature is used to reduce the number of potential classes that the candidate city name image may belong to. This preclassification is very important and used in order to accelerate the recognition process by reducing the search space. Classification step: It consists in computing the probability that a candidate city name image belongs to a class by computing the likelihood of each model. The task may be formulated as the one of searching the class model (Eq.(5)), which can reach the maximum posterior probability given a sequence of the observation O=(OV, OH): 15 ( | (5) where is the set of all possible word hypotheses. This can be computed using the Bayes theorem defined by Eq.(6): ( | where ( ( ( (6) is independent from the model , so it can be ignored from calculating a priori probability ( , and the is assumed to be equal for all the possible class hypotheses. Thus, the recognition process consists in determining the model (Fig.9.) that maximizes the likelihood ( | , i.e. maximizing the parametres in (1). This can be done using an exact inference algorithm based on the junction tree algorithm. Fig. 9. Recognition process 4. Word recognition system 4.1. System overview Arabic handwritten recognition systems typically involve five steps: preprocessing, segmentation, feature extraction, vector quantization in which the patterns are represented by a set of features, and classification in which decision rules for separating pattern classes are defined. The architecture of the developed system is illustrated by Fig.10. 16 Fig. 10. System architecture showing the main stages which must be carried out to identify the word image. As shown in Fig.10, the input image goes through the pre-processing step to normalize the word image. Afterwards, an implicit segmentation is used to divide the word images into horizontal and vertical window strips, and the segmentation decision is confirmed in the recognition step giving high recognition accuracy. After that, a combination of structural and statistical features are extracted, quantized and fed to the MSHMMs to classify the word image. These steps are detailed in the following sections. Furthermore, in order to reduce the possible word candidate and accelerate the recognition process, we have minimized the space of search by computing discrete features. Hence, the lexicon is tree structured into a sublexicon based on the number of pseudo-words in each word. Each sub-lexicon is composed of all words that contain the same number of pseudo-words. 4.2. Preprocessing In order to reduce the difficulties related to the recognition of an Arabic handwritten word provided by the IFN/ENIT database and to extract its different features, the following preprocessing have been applied: (i) thinning: which consists in normalizing the line thickness to one pixel in order to reduce the number of pixels to the minimum necessary for subsequent operations; (ii) removing the vertical and horizontal spaces and removing the horizontal elongation based on the vertical and horizontal histogram projection to reduce the writers' writing variability, thus, minimizing the within-class pattern variability while increasing the 17 between-class discrimination; (ii) baseline estimation, which is used later to extract the topological features, estimated as follows. Baseline estimation: The estimation process of the baseline is proven to be very useful and has a critical importance in extracting accurate information such as the position of possible ascenders and descenders of a letter, the writing directions, the differentiation between the diacritic dots according to their position from the baseline (above or under), and the selection of a minima to be a border point [11]. To estimate the baseline, most proposed handwritten recognition systems are based on the horizontal projection histogram. The projection method is robust and easy to implement [34], and in the current study, some feature extraction is based on it. 4.3. Implicit segmentation using a sliding window To surmount the problem of segmenting an Arabic handwritten word into characters, which is very difficult, a lot of approaches have used the sliding window/frame technique [35, 28, 4, 23, 14, 13, 12, 11]. In this study, this technique has been retained. Therefore, in this step, the feature vectors for each word image are performed by scanning the word image in two directions to obtain the characteristics over the lines and columns. Hence, a right-to-left sliding window has been applied, having the same height of the word image and a fixed width (determined empirically) to compute the characteristics over the columns. Similarly, a top-tobottom sliding window has been utilized, having the same width of the word image and a fixed height (determined empirically) to obtain the characteristics over the lines. Consequently, the word image is divided into a fixed number of non overlapping uniform horizontal and vertical frame/window strips. Finally, the feature vector sequences are calculated for each window strip. 4.4. Feature extraction and vector quantization The feature extraction goal is to give an efficient representation of the word image by a set of characteristics. These characteristics can be grouped into three categories: (i) high-level ones, which are retained from the whole word image, (ii) medium-level ones, obtained from the characters, (iii) and low-level ones, extracted from sub-characters. Also, as given by [1, 2, 6], the features can be arranged into two different types: structural features describing the topological and geometrical characteristics of a pattern, such as strokes, end points, junction points, inflexion points, cusp points, intersections of line segments, loops, dots and their position related to the baseline, etc, and statistical features which are numerical measures computed over the whole word images or regions of images and obtained from the statistical distribution of pixels, such as zoning, invariant moments, pixel densities, histogram of chain 18 code direction, Fourier descriptor, etc. These features can highlight global or local properties of the pattern according to their parameterization. So as to overlay the intra-class and interclass pattern variability, we have extracted two kinds of features which show their effectiveness in the literature related to the OCR in order to describe the characters/subcharacters. Statistical features: Hu invariant moment: The features from f1 to f7 are computed from each window strip based on the seven invariant moments proposed by Hu. These features are statistical measures of the pixel distribution around the center of gravity of the pattern and give global pattern shape information. Furthermore, these moments are broadly known to be invariant to rotation, translation and scaling of the pattern. They are computed using equations as in [35] and normalized using Eq.(7): Chain code histogram: ( (7) The features from f8 to f15 are computed from the chain-code histogram (Fig. 11) as in the work of [36]. We have generated from the chain-code sequence representation of each window strip an histogram that represents the results of counting the occurrence N(i) for each possible digit ( { (∑ } in the chain code using Eq.(8): (8) Then each possible value in the histogram I(i) will be normalized, which will represent the intensity of a direction. For example, (‫ (ا‬has a very high intensity for I(6). Fig. 11. Chain code histogram generated from a sub-character Topological features: 19 Eight additional topological features from f16 to f23, based on pixel density, have been computed from each window strip. To extract these features, a zoning has been applied, splitting each vertical window strip into two parts above and below the baseline. In addition, each of the upper and lower zones is divided into four zones. The same is for the horizontal window strips, but dividing them in eight equal zones. The feature vector represents the count of black pixels in each of the eight zones, normalized by the respective zone area. Structural features: Many researches on handwriting recognition have proven that the combination of statistical and structural features remarkably improves the recognition system performance. Therefore, four structural features, f24 to f27, have been computed from the thinned image as follows (Fig. 12): Feature point: It represents the black pixels in the word skeleton having a neighbour number different from 0 and 2. Two types are distinguished: (i) End points: They represent the beginning/ending of a line segment. The endpoints correspond to the points in the skeleton with only one neighbour and they indicate the ends of strokes. (ii) Junction points: They correspond to the connection of two or more strokes and are simply found in the skeleton as the points with more than two neighbours. They are split into cross and branch points. Inflexion points: They represent the curvature sign change in the word skeleton. Loop: It represents the skeleton's inner contours with the information reflecting their partial or complete existence inside the window strip. Fig. 12. Structural feature extraction Vector quantization: In this study, the discrete MSHMMs have been used, which require discrete values. Thus, a discrete symbol has to be generated for each continuous feature vector describing a window strip. This generation is done by a Vector Quantization (VQ) process. This method consists 20 in mapping vectors from a vector space to a finite number of regions in that space. These regions or clusters are represented by their central vector or centroids. The set of centroids describing the whole vector space is known as a codebook. Our VQ technique implements the well-known LBG algorithm [37], which is a variant of the K-means algorithm and which has the advantage of being simple and not requiring excessive computation. The choice of the codebook is made empirically and is fixed to 64, which yields to a high recognition accuracy. 5. Experiments, results and analysis This section describes the IFN/ENIT database and presents the experiments achieved and the results obtained with the corresponding analysis and discussion. All the experiments were done using the BayesNet Toolbox for Matlab. 5.1. IFN/ENIT database description The proposed models were tested on the IFN/ENIT database. It contains 946 Tunisian town/village names. It is a widely used database to compare Arabic handwriting recognition systems. It is composed of three versions, as it is described in Table 1. Each image was scanned with 300 dpi and converted to a binary image. In addition, for each binary image corresponding to the name of the city, there is a text file containing the following information:(i) the text included in the image, (ii) the postcode, (iii) the sequence and format of characters, (iv) the location of baselines, (v) the quality of the baseline, (vi) the number of words, (vii) the number of PAWs, (viii) the number of characters, (ix) and the quality of writing. Table 1. IFN/ENIT database description IFN/ENIT Version v1.0 p1 V2.0 p1e v 1.0 p3 Data set a, b, c, d e f, s Number of images 26,459 6,033 10,244 Number of writers 411 87(added) unknown Number of pseudo-words 115,000 unknown unknown Number of characters 212,000 unknown unknown 5.2. Training and recognition The developed model training is based on the EM algorithm and the cross validation process. The main focus of the cross validation is to search and optimize the parameters of the developed models during the training process. For training models, we used training database composed of (a, b, c, d) sub sets from IFN/ENIT. Then, we conducted cross validation experiments in the following way: the sub sets was split into F=4 sets. Each model was 21 trained on F−1 sets and tested on the remaining set, F times. Within each model, the parameters yielding the best cross-validation recognition performance were selected for testing. Initially, for both the holistic and the analytical approaches, the training process was realized with different values of parameters until the best set of parameters has been determined. After the preprocessing phase, several experiments were performed to assess the effect of the following parameters on the recognition performance of the system. We tried different values of the uniform segmentation by dividing the image into a different number of window strips/frames, as well as a different number of codebook sizes, denoted by O (8, 16, 32, 64, 128), and a different number of hidden states, denoted by Q (6, 12, 18, 24, 30). Fig.13 shows the mean results of these tests as a function of the number of frames, codebook size and the number of hidden state obtained by the holistic and the analytical approaches for both models (data and state fusion models). Each recognition rate estimate was obtained by a cross validation with the first four standard folds (abcd). They were performed by choosing each time three datasets for training and one dataset for testing. It is noted that increasing the number of frames improves the results until the system reaches its saturation (Fig.13). Precisely, for the data fusion model, the best results recorded for the holistic approach is 83.4 where the Number of Frames (NF) is 13, and for the analytical approach is 91.1 where the Number of Frames per Character/sub-Character (NFC) is 5. Whereas, for the state fusion model, the best results recorded for the holistic approach is 75.8 with NF=11 and for the analytical approach is 87.4 with NFC=4. Based on these settings, a second series of experiments were conducted on the training and validation data, in which different values for the number of codebooks and the number of hidden states were tested. Therefore, as it is shown in Fig. 13.c, we tried a different number of codebooks, varying from 8 to 128, and the recognition rate recorded by both models is summarized in this figure. It can be noticed that the high recognition rates for both the holistic and analytical approaches are reached when the codebook size becomes 64. As a result, an optimal codebook is fixed to 64 to find a compromise between the high recognition rate and the most reasonable time execution. Finally, to search the optimal number of the hidden states used in the MSHMM, we have tried a different number varying from 6 to 30. As it is shown in Fig. 13.d, it is noted that the highest recognition rate is reached at 24. 22 (a) (b) (c) (d) Fig. 13. Recognition rate recorded by data and state fusion models for both the holistic and analytical approaches as a function of (a) the number of frames per image, (b) the number of frames per character/subcharacter, (c) the number of codebook, (d) the number of hidden states 23 5.3. Discussion and comparison with other systems In all the performed tests and fixing the best parameters determined in the training process, it is clear that the analytical model training technique is more efficient than the holistic one. This can be attributed to the insufficient training data for some classes. In fact, as it is shown by fig.14, some words are well represented in the database. However, other words are represented by 3 samples even missing. As a result, their models are poorly trained. ‫المنصورة السواسي‬ Number of occurence 150 ‫رأس الذراع‬ ‫المنصورة‬ ‫الربط‬ ‫الرضاع‬ f s 100 50 0 a b c d e Set of database Fig. 14. Distribution of some words in the IFN/ENIT database As it is represented by Table 2, the proposed analytical approach based on the MSHMM with the combination of statistical and structural features achieves good results compared to other systems suggested in the literature. Table 2. Results obtained from the last four editions of the Arabic handwriting recognition competition. Systems are based on HMM, NN (Neural networks) or a combination of both. Recognition rate (%) System Technology Conference Abandah et RF * al. [38] Dreuw et al. [39] Graves et al. [40] Menasri [41] Schambach et al. [42] Set d Set e Set f Set s 75.49 63.75 63.86 49.75 ICDAR 2011 99,91 99,79 96,72 99.67 98,71 98,29 91,25 98.61 85,51 85,69 83,9 92.20 71,33 72,54 65,99 84.55 542,12 Unknown HMM * 94.12 86.62 79.03 68.44 Unknown HMM ICFHR 2010 99.38 98.03 92.20 84.62 Unknown NN ICDAR 2009 HMM ICDAR 2009 99,72 99,6 99,94 93,9 94,92 97.02 98,64 97,6 99,44 87,25 82,21 91.68 91,43 91,37 93,37 85,58 82,21 89.42 78,83 78,89 81,06 70,44 66,45 76.66 115,24 114,61 371,85 1056,98 519,61 2583,64 94.58 87.77 87.22 73.94 109.406 1 2 HMM 3 4 HMM + NN Hamdani et al. [12] Giménez et al. [11] 1 2 3 1 2 3 Time execution (ms) HMM ICDAR 2009 ICDAR 2007 24 1564.75 17845,12 Kessentini et al. [16] Al-Hajj et al.[14] Proposed System MSHMM 1 2 3 4 HMM MSHMM ICDAR 2009 93.04 85.46 82.09 74.51 143269.81 ICDAR 2009 92,52 89,06 89,84 92,59 85,38 81,85 83,52 86,28 82,07 78,16 79,55 83,98 69,99 65,61 67,83 72,28 812,69 2365,48 2236,58 2154,48 * 91.10 82.91 70.8 65.4 1240.86 Most of the recognition errors of the developed system can be explained by the poor quality of some writers' writing. As a consequence, some words are badly written. Furthermore, the vector quantization procedure which searches to quantize the extracted continuous features may reduce their discrimination. As a result, this can alter the discriminative power of the recognizers. To reduce these errors and ameliorate the performance of the developed system, the feature extraction process may be enhanced by computing other relevant features which can be suitable to describe the Arabic characters. In addition, instead of using synchronous 2-stream HMMs, the model might be generalized by applying n-stream HMMs. Also, to avoid the quantization process and keep the natural aspect of our features, a multivariate Gaussian HMM can be used. Conclusion This work presents a two-recognizer system for offline recognition of an Arabic handwritten Tunisian city name based on the discrete synchronous multi-stream HMM. The Arabic script is semi-cursive and has a complex morphology. Thus, segmentation becomes a very important and challenging task. Therefore, to overcome this problem, an implicit segmentation has been applied. Indeed, a sliding window has been performed in a horizontal direction scanning the image from the right to the left and in a vertical direction from the top to the bottom. The sliding window is used to compute a combination of statistical and structural features, which have proven their effectiveness in the literature, according to the columns and the lines. Then two training approaches are achieved: the whole model training and the analytical model training, utilizing the embedded training. In the first approach, the training process directly picks out the image as a whole, so all the classes have the same optimized architecture, but their parameters are different. Whereas, the second method carries out the training of the character/sub-character shapes based on the word samples without an explicit segmentation into characters, and the class model is constructed by combining its components models. Consequently, each class has its own architecture and parameters. The two main recognizers have been considered as a dynamic Bayesian network. The first one 25 focuses on combining simultaneously the extracted features at each slice time t. However, the second approach models the two observations separately, using an HMM for each stream, which are interacted with another hidden layer. The results show the effectiveness of the analytical approach and the reliability of the data-fusion model in the Arabic handwriting recognition task. Most of the recognition errors can be attributed to the poor quality of some data samples. Also, utilizing the vector quantization to quantize computed features, which are naturally continuous, can reduce their discriminative power. As a future work, we plan to conduct some experiments to assess the recognition rate of models using multiple streams of observations. Moreover, we will use a multivariate Gaussian HMM to avoid the quantization process and keep the natural aspect of our features. Reference [1] A. Lawgali, A survey on Arabic character recognition, International Journal of Signal Processing, Image Processing and Pattern Recognition 8(2) (2015) 401-426. [2] M. Shatnawi, Off-line handwritten Arabic character recognition: a survey, Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV), 2015. [3] Y. M. Alginahi, A survey on Arabic character segmentation, International Journal on Document Analysis and Recognition (IJDAR) 16(2) (2013) 105-126. [4] L. Likforman-Sulem, M. Sigelle, Recognition of degraded characters using dynamic Bayesian networks, Pattern Recognition 41(10) (2008) 3092-3103. [5] N. Murru, R. Rossini, A Bayesian approach for initialization of weights in back propagation neural net with application to character recognition, Neurocomputing (2016). [6] S. Naz, A. I. Umar, R. Ahmad, Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks, Neurocomputing (2015). [7] Q. Guo, F. Wang, J. Lei, et al., Convolutional feature learning and Hybrid CNN-HMM for scene number recognition, Neurocomputing (2015). [8] G. Zhai, J. Chen, C. LI, et al., Pattern recognition approach to identify loose particle material based on modified MFCC and HMMs, Neurocomputing 155 (2015) 135-145. [9] D.D. Mauá, A. Antonucci, C.P. de Campos, Hidden Markov models with set-valued parameters, Neurocomputing (2015). [10] S. Naz, A.I. Umar, S.H. Shirazi, et al., Segmentation techniques for recognition of Arabiclike scripts: A comprehensive survey, Education and Information Technologies (2015) 1-17. 26 [11] A. Giménez, I. Khoury, J. Andrés-Ferrer, Handwriting word recognition using windowed Bernoulli HMMs, Pattern Recognition Letters 35 (2014) 149-156. [12] M. Hamdani, H. El Abed, M. Kherallah, Combining multiple HMMs using on-line and offline features for off-line Arabic handwriting recognition, Proceedings of the International Conference on Document Analysis and Recognition (2009) 201-205. [13] J.H. AlKhateeb, Word-based handwritten Arabic scripts recognition using dynamic Bayesian network, Proceedings of the International Conference on Information Technology, Amman-Jordan, 2011. [14] R. El-Hajj Mohamad, L. Likforman-Sulem, C. Mokbel, Combining slanted-frame classifiers for improved HMM-based Arabic handwriting recognition, IEEE Transaction on Pattern Analysis and Machine Intelligence 31 (7) (2009) 1165-1177. [15] M. Glodek, F. Honold, T. Geier, et al., Fusion paradigms in cognitive technical systems for human–computer interaction, Neurocomputing 161 (2015) 17-37. [16] Y. Kessentini, T. Paquet, A.B. Hamadou, Off-line handwritten word recognition using multi-stream Hidden Markov Models, Pattern Recognition Letters 31 (1) (2010) 60-70. [17] S. Poria, E. Cambria, N. Howard, Fusing audio, visual and textual clues for sentiment analysis from multimodal content, Neurocomputing 174 (2016) 50-59. [18] O. Missaoui, H. Frigui, P. Gader, Multi-stream continuous Hidden Markov Models with application to landmine detection, EURASIP Journal on Advances in Signal Processing 1 (2013) 1-23. [19] N. Raman, S.J. Maybank, Activity Recognition using a supervised non-parametric hierarchical HMM, Neurocomputing (2016). [20] K. Jayech, N. Trimech, M.A. Mahjoub, Dynamic hierarchical Bayesian network for Arabic handwritten word recognition, Proceedings of the International Conference on Information and Communication Technology and Accessibility (2013)1-6. [21] S.M. Touj, N. Essoukri Ben Amara, H. Amiri, Arabic handwritten words recognition based on a planar Hidden Markov Model, International Arab Journal of Information Technology (IAJIT) 2 (4) (2005) 318-325. [22] K.P. Murphy, Dynamic Bayesian networks: representation, inference and learning, Doctoral dissertation, University of California, Berkeley, 2002. [23] M. Mahjoub, N. Ghanmy, K. Jayech, Multiple models of Bayesian networks applied to offline recognition of Arabic handwritten city names, International Journal of Imaging and Robotics 9 (1) (2013) 84-105. 27 [24] J. Al Abodi, L. Xue, An effective approach to offline Arabic handwriting recognition, Computers & Electrical Engineering 40(6) (2014) 1883-1901. [25] O. A. Arqub, Z. Abo-Hammour, Numerical solution of systems of second-order boundary value problems using continuous genetic algorithm, Information sciences 279 (2014) 396-415. [26] A. Benouareth, A. Ennaji, M. Sellami, Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modelling and recognition, Pattern Recognition Letters 29(12) (2008) 1742-1752. [27] M. T. Parvez, S.A. Mahmoud, Arabic handwriting recognition using structural and syntactic pattern attributes, Pattern Recognition 46(1) (2013) 141-154. [28] K. Jayech, M.A. Mahjoub, N. Essoukri Ben Amara, Arabic handwriting recognition based on synchronous multi-stream HMM without explicit segmentation, Hybrid Artificial Intelligent System (2015) 22-24. [29] V. Edwin, C.D. Maciel, Efficient methods for learning Bayesian network superstructures, Neurocomputing 123 (2014) 3-12. [30] A. Jarraya, L. Philippe, A. Masmoudi, Discrete exponential Bayesian networks: definition, learning and application for density estimation, Neurocomputing 137 (2014) 142-149. [31] P.A.P. Ramirez, I.B. Utne, Use of dynamic Bayesian networks for life extension assessment of ageing systems, Reliability Engineering & System Safety 133 (2015) 119-136. [32] R. Donat, P. Leray, L. Bouillaut, A dynamic Bayesian network to represent discrete duration models, Neurocomputing 73(4) (2010) 570-577. [33] HH. Avilés-Arriaga, L. Sucar-Succar, C. Mendoza-Durán, A comparison of dynamic naive Bayesian classifiers and Hidden Markov Models for gesture recognition, Journal of applied research and technology 9(1) (2011) 81-102. [34] M. Pechwitz, V. Maergner, H. El Abed, Comparison of two different feature sets for offline recognition of handwritten Arabic words, Proceedings of the International Workshop on Frontiers in Handwriting Recognition, 2006. [35] J. H. AlKhateeb, J. Jiang, J. Ren, Multiclass classification of unconstrained handwritten Arabic words using machine learning approaches, Open Signal Processing Journal 2 (2009) 21-28. [36] L. Dinges, A. Al-Hamadi, M. Elzobi, Offine automatic segmentation based recognition of handwritten Arabic words, International Journal of Signal Processing, Image Processing and Pattern Recognition 4(4) (2011) 131-143. 28 [37] A. Ameen Khan, N.U. Reddy, M. Rao, Speaker recognition system using combined vector quantization and discrete Hidden Markov model, International Journal of Computational Engineering 2(3) (2012) 692-696. [38] G. Abandah, F. Jamour, Recognizing handwritten Arabic script through efficient skeletonbased grapheme segmentation algorithm, Proceedings of the International Conference on Intelligent Systems Design and Applications (2010) 977–982. [39] P. Dreuw, G. Heigold, H. Ney, Confidence-based discriminative training for model adaptation in offline Arabic handwriting recognition, Proceedings of the International Conference on Document Analysis and Recognition (2009) 596-600. [40] A. Graves, J. Schmidhuber, Offline handwriting recognition with multidimensional recurrent neural networks, Advances in Neural Information Processing Systems (2009) 545-552. [41] F. Menasri, Contributions à la reconnaissance de l’écriture arabe manuscrite, Thèse de doctorat, Université Paris Descartes, 2008. [42] M. P. Schambach, Model length adaptation of an HMM based cursive word recognition system, Proceedings of the International Conference on Document Analysis and Recognition (2003). Khaoula JAYECH is a Phd student in the Department of Computer Science at National Engineering School of Sfax and member of research unit SAGE (Advanced System in Electrical Engineering), team signals, image and document in the National Engineering School of Sousse, University of Sousse. Her research interests include mainly Arabic optical character recognition, dynamic bayesian network, HMM, and data retrieval. Her main resarch have been published in international journals and conferences. Mohamed Ali MAHJOUB is an associate professor in Signal and Image processing at the National Engineering School of Sousse (university of Sousse) and member of research unit Sage (Advanced System in Electrical Engineering), team signals, image and document. His research interests include dynamic bayesian network, computer vision, pattern recognition, HMM, and data retrieval. He is a member of IEEE and his main results have been published in international journals and conferences. Najoua ESSOUKRI BEN AMARA received the MSc, PhD, and HDR degrees in electrical engineering, signal processing and system analysis, from the National School of Engineers of Tunis, Tunisia, in 1985, 1986, 1999, 2004 respectively. From 1985 to 1989, she was a 29 researcher at the Regional Institute of Informatics Sciences and Telecommunications, Tunis, Tunisia. In September 1989, she joined the Electrical Engineering Department of the National School of Engineers of Monastir, Tunisia, as an assistant professor. In July 2004, she becomes a senior lecturer at the National School of Engineers of Sousse, Tunisia. Her research interests include mainly optical character recognition applied to arabic documents, image processing, compression, ancient document processing, biometric and the use of stochastic models and hybrid approaches in the above domains. 30 31

(PDF) Synchronous Multi-Stream Hidden Markov Model for offline Arabic handwriting recognition without explicit segmentation