Patent Number: 7,827,484

Title: Text correction for PDF converters

Abstract: To correct at least one extraneous or missing space in a document, weights are assigned to tokens contained in a dictionary. Each token is defined by an ordered sequence of non-space symbols. The weights are assigned based on at least one of a token length and frequency of occurrence of the token in the document. Corrected text is generated from text of the document by applying an ordered sequence of symbol-level transformations selected from a group of symbol-level transformations including at least (i) deleting a space, (ii) inserting a space, and (iii) copying a symbol. The ordered sequence of symbol-level transformations is optimized respective to an objective function dependent upon the weights of tokens of the corrected text.

Inventors: Dejean; Herve (Grenoble, FR), Kempe; Andre (Grenoble, FR)

Assignee: Xerox Corporation

International Classification: G06F 17/00 (20060101)

Expiration Date: 2019-11-02 0:00:00