DCPY.AGGLOCLUST(n_clusters, affinity, linkage, columns)
Agglomerative clustering is a hierarchical clustering algorithm that builds nested clusters by merging or splitting them successively. This implementation uses bottom-up approach, where each observation starts in its own cluster, and then clusters are successively merged together based on the linkage metric.
Parameters
n_clusters – Number of clusters that the algorithm should find, integer (default 2).
- affinity – Metric used to compute the linkage (default euclidean)
Possible values: euclidean, l1, l2, manhattan, cosine, precomputed.
If a linkage parameter is set to ward, only euclidean is supported.
linkage – Determines which distance to use between sets of data points, which is used for merging clusters. Possible values are ward, complete, and average (default ward).
ward – Minimizes variance of the clusters being merged.
average – Uses an average distance between two sets.
complete – Uses maximum distances.
Example: DCPY.AGGLOCLUST(2, ’euclidean’, ’ward’, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
- Numeric variables are automatically scaled to zero mean and unit variance.
- Character variables are transformed to numeric values using one-hot encoding.
- Dates are treated as character variables, so they are also one-hot encoded.
- Size of input data is not limited, but many categories in character or date variables increase rapidly the dimensionality.
- Rows that contain missing values in any of their columns are dropped.
Result
- Column of integer values starting with 0, where each number corresponds to cluster assigned to each record (row) by the agglomerative clustering algorithm.
- Rows that were dropped from input data (due to containing missing values) are not assigned to a cluster.
Key usage points
- Similar to K-means, this method is a good choice when dealing with spherical clusters and normal distributions.
- Can yield better clustering results than K-means.
- Use it when you know that approximate number of clusters should be higher and when the number of data points do not exceed few thousands.
- When the clusters are ellipsoidal rather than spherical (variables are correlated within a cluster), the clustering result may be misleading.
- High in time complexity, not scalable with higher number of records.
- Must know the number of clusters.
For the whole list of algorithms, see Data science built-in algorithms.
Comments
0 comments