DCPY.AFFINITYPROPCLUST(max_iter, convergence_iter, damping, preference, columns)
Affinity propagation clustering algorithm is based on the concept of 'message passing' between data points. Unlike other clustering algorithms like K-means or K-medoids, it does not require the number of clusters to be specified by the user. It estimates the optimum number of clusters based on similarity between the data points and clusters exemplars (representatives), which is controlled by the input preference, set to average of all input similarities. Exemplars are basically centroids, except they are not the average value of all objects in each group, but rather a real observed data point that describes its closest neighbors. As a similarity measure, negative squared Euclidean distance is used.
Parameters
Parameter | Description | Data type |
---|---|---|
max_iter | Maximum number of iterations. | Integer (default 200) |
convergence_iter | Number of iterations with no change in estimation of number of clusters. | Integer (default 15) |
damping | Damping factor, which needs to be carefully tuned, to avoid numerical oscillations as messages are passed between data points, and to control the estimated number of clusters. Numbers closer to 1 tend to find less clusters. | Float (0.5—1), (default 0.5) |
preference | Preferences for each point to be chosen as an exemplar. By default, it is set to the median of the input similarities. Tune it to find optimum number of clusters. | Float (default 0) |
columns | Columns based on which you want to create the clusters. | User input |
Example: DCPY.AFFINITYPROPCLUST(200, 15 ,0.5, 0, sum([Profit]), sum([Gross Sales])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
Numeric variables are automatically scaled to zero mean and unit variance.
Character variables are transformed to numeric values using one-hot encoding.
Dates are treated as character variables, so they are also one-hot encoded.
Size of input data is not limited, but many categories in character or date variables increase rapidly the dimensionality.
Rows that contain missing values in any of their columns are dropped.
Result
- Column of integer values starting with 0, where each number corresponds to a cluster assigned to each record (row) by the algorithm.
- Rows that were dropped from input data due to containing missing values have missing value instead of an assigned cluster.
Key usage points
- Number of clusters does not need to be explicitly specified, but it is controlled via the damping parameter.
- Tends to outperform other clustering algorithms when the data set contains higher number of smaller clusters.
- Works with data sets containing clusters of different sizes.
- High time and memory complexity, which makes it suitable only for small to medium sized data sets.
For the whole list of algorithms, see Data science built-in algorithms.
Comments
0 comments