MLLIB.CLUSTER(imputer, n_clusters, n_iter, columns)
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. It includes a paralleled variant of the k-means++ for clusters initialization K-means method.
imputer – Strategy for dealing with null values:
0 – Replace null values with ‘0'
1 – Assign null values to a designated ‘-1' cluster
number_of_clusters – Number of clusters which the algorithm should find, integer.
number_of_iterations – Maximum number of iterations (recalculations of centroids) in a single run, integer.
columns – Dataset columns or custom calculations.
Example: MLLIB.CLUSTER(0, 3, 20, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
- Size of input data is not limited.
- Without missing values.
- Character variables are transformed to numeric with label encoding.
- Column of integer values starting with 0, where each number corresponds to a cluster assigned to each record (row) by the K-means algorithm.
Key usage points
- Fast and computationally efficient, very high scalability.
- Practically works well, even if some of its assumptions are broken.
- General-purpose clustering algorithm.
- When the approximate number of clusters is known.
- When there is a low number of outliers.
- When the clusters are spherical, with approximately same number of observations, density and variance.
- Euclidean distances tend to be more inflated with higher number of variables (curse of dimensionality).
- By calculating Euclidean distance, the algorithm makes assumption of only numeric input variables. One-hot encoding of categorical variables is a workaround suitable for a relatively low number of categories to encode.
- K-means makes an assumption that we deal with spherical clusters and that each cluster has roughly equal numbers of observations, density and variance, otherwise the results might be misleading.
- It always finds clusters in the data, even if no natural clusters are present.
- All data points are assigned to a cluster, even though some of them might be just random noise.
- Sensitivity to outliers.
For the whole list of algorithms, see Data science built-in algorithms.