DCPY.DBSCANCLUST(min_samples, eps, columns)
The DBSCAN (Density Based Spatial Clustering of Applications with Noise) clustering algorithm views clusters as high-density areas separated by low density areas, when finding core samples and expanding clusters from them. Euclidean distance is used as a measure when calculating distance between data points.
min_samples – The number of data points in a neighborhood for a point to be considered as a core point, integer (default 5).
eps – Maximum distance between two samples for them to be considered as the same neighborhood, float (default 0.5).
columns – Dataset columns or custom calculations.
Example: DCPY.DBSCANCLUST(5, 0.5, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
- Numeric variables are automatically scaled to zero mean and unit variance.
- Character variables are transformed to numeric values using one-hot encoding.
- Dates are treated as character variables, so they are also one-hot encoded.
- Size of input data is not limited, but many categories in character or date variables increase rapidly the dimensionality.
- Rows that contain missing values in any of their columns are dropped.
- Column of integer values starting with 1, where each number corresponds to a cluster assigned to each record (row) by the algorithm. Data points that do not belong to any cluster are considered as noise (or outliers) and are assigned to -1.
- Rows that were dropped from input data due to containing missing value have missing value instead of assigned inlier/outlier value.
Key usage points
- Automatically estimates optimum number of clusters, which can be controlled with min_samples and eps parameters.
- Not all data points are assigned to a cluster. Data points that do not belong to any cluster are considered as noise (or outliers).
- Clusters can be of any shape, but should be of similar density.
- Can find clusters completely surrounded by different clusters.
- Robust towards outliers (noise).
- Sensitivity to order of the data.
- Does not work well if clusters vary in their density.
- Not scalable with number of records and memory usage inefficiency.
- Results are very sensitive to min_samples and eps parameters.
- Suffers from 'curse of dimensionality', which may result in misleading result when the number of variables is high.
For the whole list of algorithms, see Data science built-in algorithms.