MLLIB.BICLUSTER(imputer, n_clusters, seed, columns)
Bisecting K-means is a kind of hierarchical clustering using divisive (top-down approach), where all observations start in one cluster, and splits are performed recursively as it moves down the hierarchy. The splits are done with regular K-means with K = 2 on a cluster with highest SSE (sum of squared errors). The algorithm is executed with 20 iterations to split clusters.
Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
Parameters
imputer – strategy for dealing with null values:
0 – Replace null values with ‘0'
1 – Assign null values to a designated ‘-1' cluster
number_of_clusters – Number of clusters which the algorithm should find, integer.
seed – Random seed, integer.
columns – Dataset columns or custom calculations.
Example: MLLIB.BICLUSTER(0, 3, 555, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
- Size of input data is not limited.
- Without missing values.
- Character variables are transformed to numeric with label encoding.
Result
- Column of integer values starting with 0, where each number corresponds to a cluster assigned to each record (row) by the Bisecting K-means algorithm.
Key usage points
- Less sensitivity to initialization than regular K-means.
- Tends to produce clusters of similar sizes, where K-means often produces null clusters when k is large.
- Lower computational time.
- Use it when you want to avoid convergence in local minimum.
Drawbacks
- If the number of clusters is not selected properly, it will cause a large deviation between the results and ideal clustering results.
For the whole list of algorithms, see Data science built-in algorithms.
Comments
0 comments