<i>k</i>-Means+++: Outliers-Resistant Clustering

oleh: Adiel Statman, Liat Rozenberg, Dan Feldman

Format: Article
Diterbitkan: MDPI AG 2020-11-01

Deskripsi

The <i>k</i>-means problem is to compute a set of <i>k</i> centers (points) that minimizes the sum of squared distances to a given set of <i>n</i> points in a metric space. Arguably, the most common algorithm to solve it is <i>k</i>-means++ which is easy to implement and provides a provably small approximation error in time that is linear in <i>n</i>. We generalize <i>k</i>-means++ to support outliers in two sense (simultaneously): (i) nonmetric spaces, e.g., M-estimators, where the distance <inline-formula><math display="inline"><semantics><mrow><mi>dist</mi><mo>(</mo><mi>p</mi><mo>,</mo><mi>x</mi><mo>)</mo></mrow></semantics></math></inline-formula> between a point <i>p</i> and a center <i>x</i> is replaced by <inline-formula><math display="inline"><semantics><mrow><mo movablelimits="true" form="prefix">min</mo><mfenced separators="" open="{" close="}"><mi>dist</mi><mo>(</mo><mi>p</mi><mo>,</mo><mi>x</mi><mo>)</mo><mo>,</mo><mi>c</mi></mfenced></mrow></semantics></math></inline-formula> for an appropriate constant <i>c</i> that may depend on the scale of the input. (ii) <i>k</i>-means clustering with <inline-formula><math display="inline"><semantics><mrow><mi>m</mi><mo>≥</mo><mn>1</mn></mrow></semantics></math></inline-formula> outliers, i.e., where the <i>m</i> farthest points from any given <i>k</i> centers are excluded from the total sum of distances. This is by using a simple reduction to the <inline-formula><math display="inline"><semantics><mrow><mo>(</mo><mi>k</mi><mo>+</mo><mi>m</mi><mo>)</mo></mrow></semantics></math></inline-formula>-means clustering (with no outliers).