…Lucene search engine. tion of the text content of PDF documents in a variety of encodings. The main drawback of the 5.2 Training the Logical Document Structure text extractor is that it does not always preserve Identifier the original text order. As mentioned in Section 5, we use…