# DCPY.DBSCANCLUST(min_samples, eps, columns)

The DBSCAN (Density Based Spatial Clustering of Applications with Noise) clustering algorithm views clusters as high-density areas separated by low density areas, when finding core samples and expanding clusters from them. Euclidean distance is used as a measure when calculating distance between data points.

###### Parameters

min_samples – The number of data points in a neighborhood for a point to be considered as a core point, integer (default 5).

eps – Maximum distance between two samples for them to be considered as the same neighborhood, float (default 0.5).

columns – Dataset columns or custom calculations.

Example: DCPY.DBSCANCLUST(5, 0.5, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.

###### Input data

- Numeric variables are automatically scaled to zero mean and unit variance.
- Character variables are transformed to numeric values using one-hot encoding.
- Dates are treated as character variables, so they are also one-hot encoded.
- Size of input data is not limited, but many categories in character or date variables increase rapidly the dimensionality.
- Rows that contain missing values in any of their columns are dropped.

###### Result

- Column of integer values starting with 1, where each number corresponds to a cluster assigned to each record (row) by the algorithm. Data points that do not belong to any cluster are considered as noise (or outliers) and are assigned to
**-1**. - Rows that were dropped from input data due to containing missing value have missing value instead of assigned inlier/outlier value.

###### Key usage points

- Automatically estimates optimum number of clusters, which can be controlled with
*min_samples*and*eps*parameters. - Not all data points are assigned to a cluster. Data points that do not belong to any cluster are considered as noise (or outliers).
- Clusters can be of any shape, but should be of similar density.
- Can find clusters completely surrounded by different clusters.
- Robust towards outliers (noise).
- Sensitivity to order of the data.
- Does not work well if clusters vary in their density.
- Not scalable with number of records and memory usage inefficiency.
- Results are very sensitive to
*min_samples*and*eps*parameters. - Suffers from 'curse of dimensionality', which may result in misleading result when the number of variables is high.

For the whole list of algorithms, see Data science built-in algorithms.

## Comments

0 comments