Patent Number: 6,490,582

Title: Iterative validation and sampling-based clustering using error-tolerantfrequent item sets

Abstract: Iterative validation for efficiently determining error-tolerant frequentitemsets is disclosed. A description of the application of error-tolerantfrequent itemsets to efficiently determining clusters as well asinitializing clustering algorithms are also given. In one embodiment, amethod determines a sample set of error-tolerant frequent itemsets (ETF's)within a uniform random sample of data within a database. This sample setof ETF's is independently validated, so that, for example, spurious ETF'sand spurious dimensions within the ETF's can be removed. The validatedsample set of ETF's, is added to the set of ETF's for the database. Thisprocess is repeated with additional uniform samples that are mutuallyexclusive from prior uniform samples, to continue building the database'sset of ETF's, until no new sample sets can be found. The method issignificantly more efficient than disk-based methods in the prior art, andthe data clusters found are often not discovered by traditional clusteringalgorithm in the prior art.

Inventors: Fayyad; Usama M. (Mercer Island, WA), Yang; Cheng (Mountain View, CA), Bradley; Paul S. (Seattle, WA)


International Classification: G06F 17/30 (20060101); G06F 017/30 ()

Expiration Date: 12/02015