DCPY.ENVELOPE(contamination, columns)
Elliptic Envelope is a multivariate outlier detection technique, which strongly assumes Gaussian distribution of underlying data. This assumption is used to identify outlying samples using robust covariance estimation.
Parameters
contamination – Approximate proportion of outliers in the dataset, which is used as a threshold for the decision function, float (0;1) (default 0.1).
columns – Dataset columns or custom calculations.
Example: DCPY.ENVELOPE(0.1, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
Numeric variables are automatically scaled to zero mean and unit variance.
Character variables are transformed to numeric values using one-hot encoding.
Dates are treated as character variables, so they are also one-hot encoded.
Size of input data is not limited, but many categories in character or date variables increase rapidly the dimensionality.
Rows that contain missing values in any of their columns are dropped.
Result
- Column of values 1 corresponding to inlier, and -1 corresponding to outlier.
- Rows that were dropped from input data due to containing missing values have missing value instead of assigned inlier/outlier value.
Key usage points
Data needs to be Gaussian distributed, otherwise it losses reliability.
Works well when the dataset does not contain many variables.
For the whole list of algorithms, see Data science built-in algorithms.
Comments
0 comments