Local Correlation Integral (LOCI) - Outlier Detection Algorithm
Local Correlation Integral (LOCI) is a density based approach for outlier analysis. It is local in nature, i.e., uses only nearby data points in terms of distance to compute density of a point. In this algorithm we have one tunable parameter - $latex \delta&s=2 $. Personally, I believe that we need to tune $latex k&s=2 $ also according to data distribution. LOCI works with following steps
- Compute density, $latex M(X,\epsilon)&s=2 $ of data point $latex X&s=2 $ as the number of neighbours within distance $latex \epsilon&s=2 $. Here, density is known as counting neighbourhood of data point $latex X&s=2 $
$latex
M(X,\epsilon) = COUNT_{(Y:dist(X,Y) \leq \epsilon; Y \in datapoints )} Y&s=2$ - Compute average density, $latex AM(X,\epsilon,\delta)&s=2 $ of data point $latex X&s=2 $ as the MEAN(density of neighbours of $latex X&s=2$ within distance, $latex \delta&s=2$). Here, $latex \delta&s=2$ is known as sampling neighbourhood of $latex X&s=2$
$latex AM(X,\epsilon,\delta) = MEAN_{(Y:dist(X,Y) \leq \delta)} M(Y,\epsilon)&s=2$The value of $latex \epsilon&s=2$ is always set to be half of $latex \delta&s=2$ in order to enable fast approximation. Therefore, we need to tune $latex \delta&s=2$ for accuracy without touching $latex \epsilon&s=2$
- Compute Multi-Granularity Deviation Factor (MDEF) at distance, $latex \delta&s=2$ as $latex {MDEF}(X,\epsilon,\delta) = \frac{AM(X,\epsilon,\delta) - M(X,\epsilon)}{AM(X,\epsilon,\delta)}&s=3 $
This factor shows the deviation of $latex M&s=2$ from $latex AM&s=2$ for $latex X&s=2$. Since this computation only considers local/neighbour points, therefore LOCI is referred as local in nature. The larger the value of MDEF, the greater is the outlier score. We use multiple values of $latex \delta&s=2$ to compute MDEF. Mostly we start with a radius containing 20 points to a maximum of radius spanning most of data.
- In this step, the deviation of $latex M&s=2$ from $latex AM&s=2$ is converted into binary label, i.e., whether $latex X&s=2$ is outlier or not. For this, we use $latex \sigma(X,\epsilon,\delta)&s=2$ metric as
$latex \begin{aligned}{\sigma}(X,\epsilon,\delta) = \frac{STD_{(Y:dist(X,Y) \leq \delta)}M(Y,\epsilon)}{AM(X,\epsilon,\delta)}&s=2 \end{aligned}$
Here, STD refers to standard deviation. - A data point,$latex X&s=2$ is declared as an outlier if its MDEF value is greater than $latex k. \sigma(X,\epsilon,\delta)&s=2 $, where $latex k&s=2$ is chosen to be 3.
Reference:
- I have understood this algorithm from book: Outlier Analysis by Charu Aggarwal