MLLIB.CLUSTER(imputer, n_clusters, n_iter, columns)

MLLIB.CLUSTER(imputer, n_clusters, n_iter, columns)

K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. It includes a paralleled variant of the k-means++ for clusters initialization K-means method.

  • imputer – Strategy for dealing with null values:

    • 0 – Replace null values with ‘0'

    • 1 – Assign null values to a designated ‘-1' cluster

  • number_of_clusters – Number of clusters which the algorithm should find, integer.

  • number_of_iterations – Maximum number of iterations (recalculations of centroids) in a single run, integer.

  • columns – Dataset columns or custom calculations.

Example: MLLIB.CLUSTER(0, 3, 20, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.

Input data
  • Size of input data is not limited.
  • Without missing values.
  • Character variables are transformed to numeric with label encoding.
  • Column of integer values starting with 0, where each number corresponds to a cluster assigned to each record (row) by the K-means algorithm.
Key usage points
  • Fast and computationally efficient, very high scalability.
  • Practically works well, even if some of its assumptions are broken.
  • General-purpose clustering algorithm.
  • When the approximate number of clusters is known.
  • When there is a low number of outliers.
  • When the clusters are spherical, with approximately same number of observations, density and variance.
  • Euclidean distances tend to be more inflated with higher number of variables (curse of dimensionality).
  • By calculating Euclidean distance, the algorithm makes assumption of only numeric input variables. One-hot encoding of categorical variables is a workaround suitable for a relatively low number of categories to encode.
  • K-means makes an assumption that we deal with spherical clusters and that each cluster has roughly equal numbers of observations, density and variance, otherwise the results might be misleading.
  • It always finds clusters in the data, even if no natural clusters are present.
  • All data points are assigned to a cluster, even though some of them might be just random noise.
  • Sensitivity to outliers.

For the whole list of algorithms, see Data science built-in algorithms.

Was this article helpful?
0 out of 0 found this helpful



Please sign in to leave a comment.