Patent Number: 8,880,525

Title: Full and semi-batch clustering

Abstract: A method for clustering documents is provided. Each document is represented by a multidimensional data point. The data points are initially assigned to a respective cluster and serve as their initial representative points. Thereafter, in an iterative process, the data points are clustered among the clusters, by assigning the data points to the clusters based on a comparison measure of each data point with the cluster or its representative point, and a threshold of the comparison measure. Based on this clustering, a new representative point for each of the clusters can be computed. Optionally, overlapping clusters are merged. For the next iteration, the new representative points are used as the representative points. An assignment of the documents to the clusters is output, based on a clustering of the data points in the latest iteration. Multiple batches may be processed, retaining the initial clusters to which the original batch was assigned.

Inventors: Galle; Matthias (St-Martin d'Heres, FR), Renders; Jean-Michel (Quaix-en-Chartreuse, FR)

Assignee: Xerox Corporation

International Classification: G06F 17/30 (20060101)

Expiration Date: 2019-11-04 0:00:00