Chapter 1

1.3 Basic Outlier Detection Models

Z-value test

Consider a set of 1-dimensional quantitative data observations, $$Z_i = ^{|X_i -\mu|}/_{\sigma}$$

Enough Data In cases where the mean and standard deviation of the distribution can be accurately estimated, a good “rule-of-thumb” is to use Zi ≥ 3 as a proxy for the anomaly.
Few Data at Hand However, in scenarios in which very few samples are available, the mean and standard deviation of the underlying distribution cannot be estimated robustly. In such cases, the results from the Z-value test need to be interpreted more carefully with the use of the (related) Student’s t-distribution rather than a normal distribution. This issue will be discussed in Chapter 2.

1.3.1 Feature Selection in Outlier Detection

Kurtosis measure

n. Nevertheless, a common way of measuring the non-uniformity of a set of univariate points is the Kurtosis measure. $$z_i = ^{|x_i -\mu|}/_{\sigma}$$

$$K(z_1...z_n)= ^{\sum_{i=1}^{N}z^4}/_N$$

Feature distributions that are very non-uniform show a high level of Kurtosis

1.3.2 Extreme-Value Analysis

One dimensional data

The most basic form of outlier detection is extreme-value analysis of 1-dimensional data. These are very specific types of outliers in which it is assumed that the values that are either too large or too small are outliers.

Multivariate data

Although extreme-value analysis is naturally designed for univariate (one-dimensional) data, it is also possible to generalize it to multivariate data, by determining the points at the multidimensional outskirts of the data.

1.3.3 Probabilistic and Statistical Models

In probabilistic and statistical models, the data is modeled in the form of a closed-form probability distribution, and the parameters of this model are learned. Thus, the key as- sumption here is about the specific choice of the data distribution with which the modeling is performed.

advantage

A major advantage of probabilistic models is that they can be easily applied to virtually any data type (or mixed data type), as long as an appropriate generative model is available for each mixture component.

drawback

A drawback of probabilistic models is that they try to fit the data to a particular kind of distribution, which may sometimes not be appropriate

1.3.4 Linear Models

These methods model the data along lower-dimensional subspaces with the use of linear correlations.

1.3.5 Proximity-Based Models

The idea in proximity-based methods is to model outliers as points that are isolated from the remaining data on the basis of similarity or distance functions.

Proximity-based methods may be applied in one of three ways, which are clustering methods, density-based methods and nearest-neighbor methods

clustering and density-based method

The main difference between clustering and density-based methods is that clustering methods segment the data points, whereas the density-based methods such as histograms segment the data space

nearest-neighbor

In nearest-neighbor methods [317, 456], the distance of each data point to its kth nearest neighbor is reported as its outlier score

By selecting a value of k > 1, small groups of tightly- knit points that are far away from the remaining data set can be identified and scored as outliers.

time complexity

As a result, the binary version of outlier detection often allows faster algorithms than the score-wise version of the problem. The latter is always quadratically related to the number of points in computational complexity.

clustering methods

Therefore, it is usually advisable to cluster the data multiple times and average the scores obtained from the different runs [184, 406]. The results of such an approach are often surprisingly robust.

1.3.7 High-Dimensional Outlier Detection

The data becomes increasingly sparse, and all pairs of data points become almost equidistant from one another. As a result, the outlier scores become less distinguishable from one another.

subspace outlier detection

In such cases, outliers are best emphasized in a lower-dimensional local subspace of rel- evant attributes. This approach is referred to as subspace outlier detection [4], which is an important class of algorithms in the field of outlier analysis.

Assumption

outliers are often hidden in the unusual local behavior of low-dimensional subspaces, and this deviant behavior is masked by full-dimensional analysis.

1.4 Outlier Ensembles

Such meta-algorithms combine the outputs of multiple algorithms and are referred to as ensembles.

1.4.1 Sequential Ensembles

In the first-phase, an outlier detection algorithm is used in order to remove the obvious outliers. In the second phase, a more robust normal model is constructed after removing these obvious outliers. Thus, the outlier analysis in the second stage is more accurate because many of the training points that contaminate the model of normal data have been removed.

1.4.2 Independent Ensembles

The broad principle of independent ensembles is that different algorithms might perform better on different parts of the data; Independent ensembles are used frequently for high- dimensional outlier detection, because they enable the exploration of different subspaces of the data in which different types of deviants may be found.

1.7 Outlier Evaluation Techniques

external validity measures

This trade-off can be measured in terms of precision and recall, which are commonly used for measuring the effectiveness of set-based retrieval.

Chapter 2

2.3.4 Distance Distribution-based Techniques: The Mahalanobis Method

The Mahalanobis distance of a point is similar to its Euclidean distance from the centroid of the data, except that it normalizes the data on the basis of the inter-attribute correlations.

2.4 Probabilistic Mixture Modeling for Outlier Analysis

Chapter 3 Linear Models for Outlier Detection

The main assumption in linear models is that the (normal) data is embedded in a lower-dimensional subspace. Data points that do not naturally fit this embedding model are, therefore, regarded as outliers.

statistical regression modeling

The first class of models uses statistical regression modeling between dependent and independent variables in order to determine specific types of dependencies in the data. Such forms of modeling are more useful when some of the attributes are naturally predictive of others. For example, it is natural to predict the last value of a time-series based on a window of previous history of values in the time-series. In such cases, dependency-oriented regression modeling is leveraged by creating a derived data set of dependent and independent variables and quantifying deviations in the observed value of the dependent variable from its predicted value. Even for multidimensional data, we will show in section 7.7 of Chapter 7 that dependency-oriented models for regression can be used to decompose the unsupervised outlier detection problem into d different regression modeling problems for d-dimensional data.

principal component analysis

The second class of models uses principal component analysis to determine the lower-dimensional subspaces of projection. These models are useful for traditional multidimensional data because all attributes are treated in a homogeneous way with no distinction between dependent and independent variables.

3.2 Linear Regression Models

The basic idea here is that the (normal) data are assumed to lie on a lower-dimensional hyperplane in the feature space. The normalized deviation of a point from this lower-dimensional hyperplane (in a direction perpendicular to the hyperplane) is used to compute the outlier score

3.2.1 Modeling with Dependent Variables (Quite Useless)

y is dependent on x1,...xd... one special variable

3.2.2 Linear Modeling with Mean-Squared Projection Error

A more general form of regression modeling is one in which all variables are treated in a similar way, and the optimal regression plane is determined the minimize the projection error of the data to the plane.

It is assumed that the data is mean-centered, and therefore the constant term wd+1 is miss- ing in this case. Each variable is associated with a coefficient, and the “special” (dependent) variable y is missing in this case: $$||w_1\cdot x_1+...+w_d\cdot x_d = 0||$$
Therefore, we need to allow for an error value $$\epsilon_j$$ $$\overline{W}\cdot \overline{X_j}=\epsilon_j$$ for N instances corresponding to the d-dimensinal data points.
Minimizing the $$\sum_{i=1}^N\epsilon_j^2$$ with the normalization constratint $$||\overline{W}||^2=\sum_{i=1}^dw_i^2=1$$

3.3 PCA

Hard way: In the aforementioned method, we remove the k largest eigenvectors in a hard way and compute a weighted sum of these squared distances along the remaining (d − k) distances as the outlier score.
Soft apporach: A simpler special case would be to use a soft approach for weighting the distances along all the different eigenvectors rather than selecting a particular set of eigenvectors in a hard way. Algortihms to calcualte ration on BOOK

3.3.2 HARD versus SOFT PCA (Mahalanobis method is better)

The Mahalanobis method can be viewed as a type of soft PCA in which the principal components are weighted rather than pre-selected. The authors think Mahalanobis method is better.

3.3.8 Extension to Nonlinear Data Distributions

Use kenerl tricks to apply PCA to nonlinear setting.

explicit polynomial transformations of the data.
iplicitly with the use of carefully designed N × N similarity matrices between all pairs of data points (kenerl tricks)

Construct the positive semi-definite N × N kernel similarity function S using an off- the-shelf kernel function or other similarity matrix computation (e.g., sparsified and normalized similarity matrix used in spectral methods [378]).
Diagonalize the matrix S = QΛ2QT and set the new N × k embedding D′ to the first k columns of QΛ corresponding to largest eigenvalues. The default assumption is to set k so that all nonzero eigenvectors of S are included. Note that k can be larger than d in the nonlinear setting.
Standardize each column of D′ to unit variance by dividing it with its standard deviation.
For each row of D′, report its (squared) Euclidean distance from the centroid of D′ as its outlier score.

3.4 One-Class Support Vector Machines

Very sensitive to parameters

Both these problems can be partially solved using variable subsampling [32], which works particularly well in the context of unpredictable data size-sensitive parameters. The basic idea is to repeatedly sample a variable number of training points from the data, which vary between nmin = 50 and nmax = 1000.

3.4.3 Connections to Support Vector Data Description and Other Kernel Models

The support- vector data description (SVDD) [539] is a different kernel SVM method in which data is enclosed in a hypersphere of radius R in the transformed feature space (rather than a linear separator from the origin). On the other hand, the SVDD approach performs poorly with polynomial kernels because the data tends to be elongated in particular directions

A key difference between the kernel Mahalanobis method and other one-class methods is the use of feature-space scaling of all transformed dimensions to unit variance in the former. This type of normal- ization is especially helpful because it prevents high-variance directions from masking the outliers.

In fact, the use of a circular separator (like SVDD) on the normalized kernel representation works reasonably well with a polynomial kernel, which is generally known to work poorly with SVDD

Among all these models, the kernel Mahalanobis method has the advantage of requiring the least number of free parameters (which only correspond to the kernel setting).

The Mahalanobis method is a soft approach that focuses on finding scores and (appropriately) leaves the hard labeling to the end, when more insight is available on the score distribution.

Chapter 4 Proximity-Based Outlier Detection

Cluster-based Distance-based Density-based

4.2 Clusters and Outliers: The Complementary Relationship

A well-known complementary relationship exists between clustering and outlier detection. A simplistic view would be that every data point is either a member of a cluster or an outlier.

However, it is important to understand that using only the complementary relationship of points to clusters in order to define outliers results in the discovery of weak outliers or noise.

Outlier score distances of data points to cluster centroids.

A simple definition of the outlier score may be constructed by using the distances of data points to cluster centroids. Specifically, the distance of a data point to its closest cluster centroid may be used as a proxy for the outlier score of a data point. Since clusters may be of different shapes and orientations, an excellent distance measure to use is the Mahalanobis distance, which scales the distance values by local cluster variances along the directions of correlation

Mahalanobis distance

It is noteworthy that the use of the Mahalanobis distance achieves similar goals of local normalization as achieved by some other local density-based methods discussed later in this chapter (like LOF and LOCI).

cluster cardinality

In addition to distance-based criteria, it is common to use cluster cardinality as a component of the outlier score. For example, the negative logarithm of the fraction of points in the nearest cluster can be used as a component of the outlier score. Incorporation of cluster cardinality into the scores is helpful in distinguishing small groups of clustered outliers from normal points occurring in larger clusters. The identification of clustered anomalies can be achieved even more effectively by using a minimum threshold on the number of data points in each cluster.

clustering methods much better

In general, clustering methods are much better than histogram-based methods in handling clustered anomalies because they partition the data points rather than the data space in a more flexible way;

4.2.1 Extensions to Arbitrarily Shaped Clusters

Although the Mahalanobis computation is effective for the case of elliptically shaped (Gaussian) clusters, it is not quite as effective for the case of clusters that are of arbitrary and non-convex shape.

global embeddings

As in the case of nonlinear PCA, we can use the largest eigenvectors of an N × N kernel similarity matrix $$S = [s_{ij}]$$ to embed the data into a multidimensional space in which Euclidean distance functions can be used effectively for clustering. The $$(i,j)th$$ entry of S is equal to the kernel similarity between the ith and jth data points. In this sense, the kernel similarity functions described in section 3.3.8.1 of Chapter 3 provide a good starting point. The Gaussian kernel is typically preferred and the bandwidth is set to the median distance between sampled pairs of data points. However, these methods are best suited to discovering global embeddings of the data for cases like Figure 4.3(a) in which the entire data set is arranged in a single distribution.

Local data distribution

These modifications, which are derived from the notion of spectral clustering [378], are those of similarity matrix sparsification and local normalization [297, 378]:

4.3 Distance-Based Outlier Analysis

Distance-based methods work with the natural assumption that the k-nearest neighbor distances of outlier data points are much larger than those of normal data points.

Advantages:
1. Distance-based methods generally have a higher granularity of analysis as compared to clustering-based methods. This property of distance-based methods can enable a more refined ability to distinguish between weak and strong outliers in noisy data sets.
2. Distance-based methods are also able to identify isolated clusters of closely related out- liers.
Disadvantages: TIme complexity
1. However, if the outlier score of each data point is required, then the (vanilla version of the) algorithm requires operations exactly proportional to N2. In the binary decision version of identify- ing whether a data point is an outlier, it is possible to use various types of pruning and indexing structures to substantially speed up the approach.

4.3.1 Scoring Outputs for Distance-Based Methods

Exact k-Nearest Neighbor Score

the outlier score of any data point Xi is equal to its kth nearest neighbor distance to points in D − {Xi}.

The main problem with this definition is that it is difficult to know the “correct” value of k for any particular data point.

Average k-Nearest Neighbor Score

The outlier score of any data point Xi is equal to its average distance to its k nearest neighbors in D − {Xi}.

Correct k

However, in unsupervised problems like outlier detection, it is impossible to know the correct value of k for any particular algorithm, and an analyst might use a range of values of k. For example an analyst might test the algorithm with equally spaced values of k in [1,N/10].

Harmonic k-Nearest Neighbor Score (Rarely used)

Remove any repeated points from the data set. (Would always be zero)

Simply setting k = N

The main advantage with the harmonic average is that we now have a parameter-free detector by setting k = N, and the results are still of high quality.

Computing the scores of all the data points is generally computationally intensive.

A more effective approach is to pre-select a sample of the data points. All N data points are scored with respect to this sample after excluding the candidate point from the sample (if needed). The results can be averaged over various samples in order to improve the quality of the results with the ensemble [35]. This is especially the case for the less stable variants such as harmonic averaging.

4.3.2 Binary Outputs for Distance-Based Methods

More efficient

To determine outlier

Score Threshold-Based Distance Outliers (fraction f of the objects in D lies greater than distance $$\Beta$$ from Object O)
Score Threshold-Based Distance Outliers (exact kth-nearst neighbro distance is at least $$\Beta$$)
(Rank Threshold-Based Distance Outliers

4.3.2.1 Cell-Based Pruning

4.3.2.2 Sampling-Based Pruning

4.3.2.3 Index-Based Pruning

4.3.3 Data-Dependent Similarity Measures

The k-nearest neighbor method can be paired with other types of similarity and distance functions. Unlike the Euclidean distance, data-dependent similarity measures depend not just on the pair of points at hand but also on the statistical distribution of the other points.

shared nearest neighbor similarity measure (locality-sensitive similarity measure)

Idea:

The basic idea here is that in a dense region, two points have to be very close in order to have a large number of common nearest neighbors, whereas in a sparse region it is possible for reasonably distant points to have a larger number of common neighbors.

Disadvantages:

Two parameters to tune
Time complexity high

Solution:

It is possible to compute this similarity in a robust and efficiently way by repeatedly sampling the data, computing the measure, and averaging the measure over different samples. When computing the simi- larity with sampled data, the shared nearest neighbor distances are computed between all pairs of points (whether they are sampled or not), but only the number of shared neighbors within the sample are counted.

pairwise Mahalanobis distance (correlation-sensitive distance measure)

Idea:

This measure is equivalent to computing the Euclidean distance between pairs of points after transform- ing the data to uncorrelated directions with PCA and normalizing each dimension in the transformed data to unit variance.

Random froests (data-dependent similarity )

Random forests can be used to compute data-dependent similarity because of their ability [359] to define data locality in a distribution-dependent way.

The main advantage of this approach is that it is very efficient to compute the similarity between a pair of instances once the random forest has been constructed.

4.3.4 ODIN: A Reverse Nearest Neighbor Approach

Definition 4.3.7 A data point p is a reverse k-nearest neighbor of q, if and only if q is among the k-nearest neighbors of p.

The reverse nearest-neighbor method is also referred to as Outlier Detection using In-degree Number (ODIN)

Disadvatages:

This approach requires the determination of all the k-nearest neighbors of each node. Furthermore, distance-based pruning is no longer possible since the nearest neighbors of each node need to be determined explicitly. Thus, the approach may potentially require O(N2) time for construction of the k-nearest neighbor graph. The other problem with the approach is that many data points might be ties in terms of the number of reverse nearest neighbors (outlier score).

4.3.5 Intensional Knowledge of Distance-Based Outliers

The idea is to explain the outlier behavior of objects in terms of subsets of attributes. Thus, in this case, a minimal bounding box on the subsets of attributes is presented in order the explain the outlier behavior of the data points.

Outliers that are defined by minimal combinations of attributes are generally considered stronger from an intensional perspective.

4.3.6 Discussion of Distance-Based Method

more detailed granularity

Distance-based methods have a number of qualitative advantages over clustering-based techniques because of the more detailed granularity of the analysis.

For example, distance-based algorithms can distinguish between noise and anomalies much better than cluster-based techniques.
distance-based methods can also find isolated groups of outliers just like clustering methods.
clustering methods have the advantage that they can provide insights about the local distributions of data points for defining distances.

4.4 Density-Based Outliers

4.4.1 LOF: Local Outlier Factor

4.5 Limitations of Proximity-Based Detection

Curse of dimensionality

Chapter 5 High_Dimensional Outlier Detection:

5.1 Introduction

A main cause of the dimensionality curse is the difficulty in defining the relevant locality of a point in the high-dimensional case.

In such cases, the outliers are lost in the random distributions within these views, when the distance measurements are performed in full dimensionality. This situation is often naturally magnified with increasing dimensionality.

locally relevant subspaces (Like warn & error)

In other words, the outliers are often embedded in locally relevant subspaces.

Features too many may mask the outliers.

--> Projected outlier Detection / Subspace outlier detection

In general, selecting a single relevant subspace for each data point can cause unpredictable results, and therefore it is important to combine the results from multiple subspaces. In other words, subspace outlier detection is inherently posed as an ensemble-centric problem.

Rarity-based:
Unbiased:
Aggregation-based:

5.2 Axis-Parallel Subspaces

One class of methods, points are examined one by one and their relevant outlying subspaces are identified. computationally expensive
second class of methods, outliers are identified by building a subspace model up front feature bagging, rotated bagging, subspace histograms, and isolation forests, more robust sometimes

5.2.1 Genetic Algorithms for Outlier Detection

Subspace outliers are identified by finding localized regions of the data in low-dimensional space that have abnormally low density.

5.2.1.1 Defining Abnormal Lower-Dimensional Projections

An abnormal lower-dimensional projection is one in which the density of the data is exceptionally lower than average.

A grid-based approach is used in order to identify rarely populated local subspace regions.

5.2.1.2 Defining Genetic Operators for Subspace Search

Hard to search all

An exhaustive search of all the subspaces is impractical because of exponential computational complexity.

Genetic algorithms / evolutionary algorithms (Stupid)

5.2.3 Feature Bagging: A Subspace Sampling Perspective

Randomly choos r dimensional. Repeat several times to capture outlier scores in subspace. Need to normalized

5.2.4 Projected Clustering Ensembles

Projected clustering methods define clusters as sets of points together with sets of dimensions in which these points cluster well.

Use a randomized projected clustering method like PROCLUS [5] on the data set to create a set of projected clusters.
Quantify the outlier score of each point based on its similarity to the cluster to which it belongs. Examples of relevant scores include the size, dimensionality, (projected) distance to cluster centroid, or a combination of these factors. The proper choice of measure is sensitive to the specific clustering algorithm that is used.

5.2.5 Subspace Histograms in Linear Time

A linear-time implementation of subspace histograms with hashing is provided in [476] and is referred to as RS-Hash. The basic idea is to repeatedly construct grid-based histograms on data samples of size s and combine the scores in an ensemble-centric approach.

5.2.6 Isolation Forests

An isolation forest is an ensemble combination of a set of isolation trees. In an isolation tree, the data is recursively partitioned with axis-parallel cuts at randomly chosen partition points in randomly selected attributes, so as to isolate the instances into nodes with fewer and fewer instances until the points are isolated into singleton nodes containing one instance.

5.2.7 Selecting High-Contrast Subspaces (HiCS)

Although the high-contrast subspaces are obtained using aggregation-based statistics, these statistics are only used as hints in order to identify multiple subspaces for greater robustness.

Assumption:

rare patterns are statistically more likely to occur in subspaces where there is significant non-uniformity and contrast.

Better than feature bagging method

Therefore, the only difference from feature bagging is in how the subspaces are selected; the algorithms are otherwise identical. It has been shown in [308] that this approach performs better than the feature bagging method.

Procedure

The first step is to select discriminative subspaces using an Apriori-like [37] explo- ration, which is described at the end of this section. These are high-contrast subspaces. Furthermore, the subspaces are also pruned to account for redundancy among them.
- An important part of this exploration is to be able to evaluate the quality of the candidate subspaces during exploration. This is achieved by quantifying the contrast of a subspace.
Once the subspaces have been identified, an exactly similar approach to the feature bagging method is used. The LOF algorithm is executed after projecting the data into these subspaces and the scores are combined as discussed in section 5.2.3.

Apriori-like exploration and the corresponding quantification of the contrast

The conditional probability P(x1|x2 ...xp) for an attribute value x1 is the same as its unconditional probability P(x1) for the case of uncorrelated data. High-contrast subspaces are likely to violate this assumption because of non-uniformity in data distribution. In our earlier example of the young Alzheimer patients, this corre- sponds to the unexpected rarity of the combination of youth and the disease. In other words P (Alzheimer = 1) is likely to be very different from P (Alzheimer = 1|Age ≤ 30). The idea is that subspaces with such unexpected non-uniformity are more likely to contain outliers, although it is treated only as a weak hint for pre-selection of one of multiple subspaces.

Summary

The HiCS technique is notable for the intuitive idea that statistical selection of rel- evant subspaces is more effective than choosing random subspaces Computational intensive

5.2.8 Local Selection of Subspace Projections

The work in [402] uses local statistical selection of relevant subspace projections in order to identify outliers. In other words, the selection of the subspace projections is optimized to specific data points, and therefore the locality of a given data point matters in the selection process.

The OUTRES method [402] examines the density of lower-dimensional subspaces in order to identify relevant projections. The basic hypothesis is that for a given data point X, it is desirable to determine subspaces in which the data is sufficiently non-uniformly distributed in its locality.

5.2.9 Distance-Based Reference Sets

TOO bad, also another parameter to tune. XD This is because the approach completely ignores the interactions among various dimensions. Another problem is the use of a single subspace to score outliers.

5.3 Generalized Subspaces

These algorithms are generalizations of the following two classes of algorithms:

The PCA-based linear models discussed in Chapter 3 find the global regions of correla- tion in the data. For example, in the case of Figure 5.3, the outliers can be effectively dentified by determining these global directions of correlation. However, no such global directions of correlation exist in Figure 5.4.
The axis-parallel subspace outliers discussed earlier in this chapter can find deviants, when the data is naturally aligned along low dimensional axis-parallel subspace clusters. However, this is not the case in Figure 5.4, in which the data is aligned along arbitrary directions of correlation.

The goal in generalized subspace analysis is to combine the ideas in these two types of algorithms. In other words, it is desired to determine the arbitrarily oriented subspaces simultaneously with outlier discovery.

5.3.1 Generalized Projected Clustering Approach

The method discussed in [7] has a built-in mechanism in order to determine the outliers in addition to the clusters. Such outliers are naturally data points that do not align with the cluster

Clearly, methods are required for properly distinguishing between the strong and weak outliers by using an outlier scoring mechanism to distinguish between various points. The simplest approach is to compute the local Mahalanobis distance of every candidate outlier to each cluster centroid. The computation of the local Mahalanobis distance of a point to a cluster centroid uses only the mean and covariance matrix of that cluster.

To improve robustness, one should use randomized methods to cluster the data in multiple ways, and average the point-wise scores from the different models. It is very important to combine scores from multiple subspaces, because the individual subspaces discovered using subspace clustering do not tell us much about the relevant subspaces for outlier detection.

5.3.2 Leveraging Instance-Specific Reference Sets

The main difference from earlier generalized subspace clustering methods is that local reference sets are specific to the various data points, whereas clusters provide a fixed set of reference sets that are used to score all points.

Time complexity high!

5.3.3 Rotated Subspace Sampling (rotated bagging.)

The basic idea in rotated bagging is to sample randomly rotated subspaces in lower-dimensional space, and score each point in this low-dimensional space. The scores from various subspaces can be combined to provide the final result. In particular, the approach uses subspaces of dimensionality r = 2 + ⌈√d/2⌉, which is much lower the typical dimensionality of the subspace used in feature bagging.

5.3.4 Nonlinear Subspaces

Chapter 7

7.1 Introduction

The general recommendation for outlier analysis is to always use supervision where possible.

How is the supervised outlier detection problem different from classification?

Class imbalance
Contaminated normal class examples
Partial training information (semi-supervision or novel class detection)

7.2 Full Supervision: Rare Class Detection

Assumptions:

The proper evaluation mechanisms in these settings weigh the errors of anomalous in- stances differently from those of normal instances. The basic assumption is that it is more costly to misclassify anomalous instances as compared to normal instances.

Therefore, the model should optimize a cost-weighted variant of the accuracy. The underlying algorithms are also changed to incorporate these modified modeling assumptions.

Cost-sensitive learning The objective function of the classification algorithm is modified in order to weight the classification errors in a differential way for the normal classes and the rare classes. The typical assumption is that the misclassification of a rare class incurs higher costs. In many cases, this change in the objective function requires only minor changes to existing classification models.
Adaptive re-sampling The data are re-sampled so as to magnify the relative proportion of the rare classes. Such an approach can be considered an indirect form of cost-sensitive learning, since data re-sampling is equivalent to implicitly assuming higher costs for misclassification of rare classes. After all, the presence of a larger rel- ative number of instances of a particular class in the sample (with respect to original data) tends to bias the prediction algorithms in favor of that class.

7.2.1 Cost-Sensitive Learning

Goal:

In cost-sensitive learning, the goal is to learn a classifier that maximizes the weighted accuracy over the different classes.

Objective function

$$J=\sum_{i=1}^kc_i\cdot n_i$$

$$n_i$$ is the number of examples misclassfied for the ith class.
The choice of $$c_i$$ is regulated by application-specific requirements and there fore a part of the input. However, $$c_i$$ can be set proportinal to $$1/N_i$$, the aggregate impact of the instances of each class on the weighted misclassification rate is the same.

7.2.1.1 MetaCost: A Relabeling Approach (Complicated)

Idea:

In this method, the idea is to relabel some of the training instances in the data, by using the costs, so that normal training instances that have a reasonable probability of classifying to the rare class are relabeled to that rare class. This suggests that the effect of cost weighting can sometimes be overwhelmed by the erroneous skews in the probability estimation attained by bagging. In general, it is possible for some classification algorithms to work effectively with meta-cost under the following conditions:

It is extremely important to use an unstable algorithm as the base classifier. Stable algorithms will lead to probability estimates that are close to 0 or 1.
Even though the approach in [175] proposes the use of smaller bootstrapped samples solely for efficiency considerations, an unobserved side-effect is that it will reduce over- lap among different samples. Reducing overlap among samples is helpful for reducing correlation among classifiers, which is likely to improve the probability estimation. Smaller samples will also lead to unstable classifiers.

7.2.1.2 Weighting Methods

Most classification algorithms can be modified in natural ways to account for costs. The primary driving force behind these modifications is to implicitly treat each training instance with a weight, where the weight of the instance corresponds to its misclassification cost.

7.2.2 Adaptive Re-sampling

Either the rare class can be oversampled, or the normal class can be under-sampled, or both types of sampling can be simultaneously executed.

Undersampleing has several advantages over oversampling.

Synthetic Over-sampling: SMOTE

7.2.3 Boosting Methods

7.3 Semi-Supervision: Positive and Unlabeled Data

Therefore, a simple solution is to use the sampled background collection as the unlabeled class for training, and simply tolerate the presence of positive contaminants.

Consider it as noise.

How to deal with unlabled data

In the first class of methods, heuristics are used in or- der to identify (unlabeled) training examples that are negative (i.e., normal). Subsequently, a classifier is trained on the positive examples, together with the unlabeled examples that have been reliably predicted to be negative. A less common approach is to assign weights to the unlabeled training examples [348, 362]. The second case is a soft version of the first, because the first setting can be viewed as an approach in which binary weights are used.

Solution:

Therefore, it may be better to simply discard the negative examples and build the model only from the anomalous examples

7.4 Semi-Supervision: Partially Observed Classes

7.4.2 One-Class Learning with Normal Examples

In general, when working with inductive learning methods for unsupervised outlier detection (like one-class SVMs), it is always a good practice to distinguish between training and test data (by cross-validation) even when they are not distinguished in the original data.

The main difference is that the training and test data are distinguished from one another, and the outlier score is computed for a test instance with respect to the training data. In general, when working with inductive learning methods for unsupervised outlier detection (like one-class SVMs), it is always a good practice to distinguish between training and test data (by cross-validation) even when they are not distinguished in the original data.

7.5 Unsupervised Feature Engineering in Supervised Methods

Example

Gaussian distribution->linear SVM not wroking because not linearly separable.

Mean distance

But if we use distance with mean. we get it correct.

Kernel, may cause overfitting

Overfitting is a common problem when there is a limited number of instances of any class; this is common in rare-class learning because of the paucity of anomalous examples. Kernel methods also cannot control the opaque representa- tion created by the kernel function and are therefore unable to perform feature selection in a controlled way.

LOF -> local outliers k-nearst neighbor -> global outliers

Redundant features that are highly correlated

a linear support-vector machine or logistic regressor with L1-regularization will automatically remove the effect of redundant or irrelevant features from the model.

Both in supervised or unsupervised

Supervised

It is noteworthy that feature engineering methods can be either supervised or unsupervised. For example, this approach shares a number of similarities with a supervised feature engineering and ensemble method, referred to as stacking [33]. In stacking, the features are learned using the output of supervised classification algorithms rather than unsupervised outlier detection algorithms.

Unsupervised

Unsupervised feature engineering methods are often more ef- fective when the amount of training data is small. If a lot of labeled data is available, then it is possible to partition the data between the feature learning part and the second-phase of prediction for better results (with stacking).

In the rare-class setting, the amount of training data for at least one of the classes is very little. As a result, an unsupervised approach, such as the use of outlier scores, is appropriate in this setting.

7.6 Active Learning (Only suggestions might be used sometime in later of the projcet after all data are collected)

Let experts to label the data that is not very clear

7.7 Supervised Models for Unsupervised Outlier Detection

Attribute-wise learning for scoring outliers (ALSO)

Idea:

One of the attributes as the dependent variable, use regression model The basic idea [429] is to repeatedly apply a regression model Mk by selecting the kth attribute as the dependent variable and the remaining attributes as the independent variables. The squared-error in prediction of the dependent variable is used to compute the score for the instance for that iteration. This approach is repeated with each attribute as the dependent variable and a weighted average is used to construct the final score.

Standardization:

It is important to standardize the data to unit variance as a preprocessing step in order to avoid problems associated with relative scaling of various attributes.

Choose weight of each compoent of regression model

Chapter 9

9.1

Contextual or collective anomalies

Outtliers in a time series are either contextual or collective anomalies. Outliers are contextual when the values at specific time stamps suddenly change with respect to their temporally adjacent values, whereas outlier are collective when entire time series or large subsequences within a time series have unusual shapes.

About labels and supervised

Labels may be available to supervise the anomaly detection process in both the time- series or multidimensional outlier detection settings. In the time-series setting, the labels may be associated with time-instants, with time intervals, or they may be associated with the entire series. In the multidimensional setting, labels are associated with the individual data points. In such cases, the problem reduces to a special case of rare-class detection (cf. Chapter 7). In general, supervised methods almost always perform better than unsupervised methods because of their ability to discover application-specific abnormalities. The general recommendation is to always use supervision when it is available.

9.2 Prediction-Based Outlier Detection in Streaming Time Series

Correlations across time: This is the same principle as temporal continuity, which is typically implemented using autoregressive modeling and forecasting. Significant deviations from the expected (i.e., forecasted) predictions are defined as outliers. Such significant deviations are therefore defined by violations of temporal continuity.
Correlations across series: Many sensor applications result in time series that are often closely correlated with one another. For example, a bird call at one sensor will typically also be recorded by a nearby sensor. In such cases, one series can frequently be used in order to predict another. Deviations from such expected predictions can be reported as outliers.

9.2.1 Autogressive Models

Autogressive Models AR Moving- average model (MA Model) Autoregressive Moving Average (ARMA) model Non-stationarity-> can be de-trended by first differencing the time series before ARMA modeling. Such a modeling approach is referred to as Autoregressive Integrated Moving Average Model (ARIMA).

9.2.2 Multiple Time Series Regression Models

Lag Correlations

Using also timestamp from past window in other time series.

Time-Series Selection Methods

One can select a subset of streams in order to perform the regression with respect to a smaller set of variables.
A fundamentally different approach is to use the notion of hidden variables to decom- pose the multivariate forecasting problem into a (more easily solvable) set of univariate forecasting problems.

9.2.3 Relationship between Unsupervised Outlier Detection and Prediction

9.2.4 Supervised Point Outlier Detection in Time Series

primary abnormal events.

Need to be find. Composite alarm level Maximize the discrimination between the alarm at the primary events and normal periods:

9.3 Time-Series of Unusual Shapes

Full-series anomaly: In this case, the shape of the entire series is treated as an anomaly. This shape is compared against a database of similar time series. However, in most cases, unless the database of sequences corresponds to a relatively short segment of time-stamps, the noise variations within the series will mask the anomalous shape. This is analogous to the problems of noise encountered in detecting outliers in high- dimensional data.
Subsequence-based anomaly: If the time series is collected over long periods of time, then we might have a single time series that shows typical trends over shorter time-periods. Therefore, the anomalous shape is detected over small windows of the time series as deviations from these typical trends.
Normalized mean = 0 ,std =1 In all cases, it is assumed that the series has been normalized to zero mean and unit standard deviation because it is not meaningful to compare series of different means and amplitudes. Furthermore, in many models, it is assumed that all the underlying series are of the same length n.

9.3.1 Transformation to Other Representations

Numeric Multidimensional Transformations -> Singal Processing... Emm not useful

Discrete Wavelet Transform (DWT)
Discrete Fourier Transform (DFT) The simplest possible transformation would be to consider each window of length n in the time series as a multidimensional vector of length n. Other methods such as Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT) are available for compressing the series with the multi-scale approach into numeric coefficients.The most common type of wavelet is the Haar wavelet. The global average of the series is stored as the first coefficient.

Note that each coefficient represents the difference between the first half and second half of a particular segment of a series.

a multidimensional data set rather than as a dependency-oriented data set because the short-term and long-term dependencies in the data are already encoded inside the wavelet coefficients

Discrete Sequence Transformations

A commonly used discretization technique is the Symbolic Aggregate Approximation (SAX) [361]. In this method, Piecewise Aggregate Approximations (PAA) are used in order to represent the time series.

Leveraging Trajectory Representations of Time Series

This is an extremely difficult problem because it combines trajectory analysis with multivariate time-series analysis, and is still an open problem in the literature.

9.3.2 Distance-Based Methods

Care must be taken to compare a window of values with non-overlapping windows, in order to minimize the self-similarity bias of overlapping windows.

9.3.3 Probabilistic Models

Probabilistic models use a generative process to model the generation of a set of time series. For example, one can extract a set of subsequences from a given time series and then model these time series to be generated from one of the components of this mixture. The most common generative model for clustering time series is that of a hidden Markov model (HMM).

9.3.4 Linear Models

Univariate Series

PCA based method.

YikaiLL/OutlierAnalysisBook.md

Chapter 1

1.3 Basic Outlier Detection Models

Z-value test

1.3.1 Feature Selection in Outlier Detection

Kurtosis measure

1.3.2 Extreme-Value Analysis

One dimensional data

Multivariate data

1.3.3 Probabilistic and Statistical Models

advantage

drawback

1.3.4 Linear Models

1.3.5 Proximity-Based Models

clustering and density-based method

nearest-neighbor

time complexity

clustering methods

1.3.7 High-Dimensional Outlier Detection

subspace outlier detection

Assumption

1.4 Outlier Ensembles

1.4.1 Sequential Ensembles

1.4.2 Independent Ensembles

1.7 Outlier Evaluation Techniques

external validity measures

Chapter 2

2.3.4 Distance Distribution-based Techniques: The Mahalanobis Method

2.4 Probabilistic Mixture Modeling for Outlier Analysis

Chapter 3 Linear Models for Outlier Detection

statistical regression modeling

principal component analysis

3.2 Linear Regression Models

3.2.1 Modeling with Dependent Variables (Quite Useless)

3.2.2 Linear Modeling with Mean-Squared Projection Error

3.3 PCA

3.3.2 HARD versus SOFT PCA (Mahalanobis method is better)

3.3.8 Extension to Nonlinear Data Distributions

3.4 One-Class Support Vector Machines

3.4.3 Connections to Support Vector Data Description and Other Kernel Models

Chapter 4 Proximity-Based Outlier Detection

4.2 Clusters and Outliers: The Complementary Relationship

Outlier score distances of data points to cluster centroids.

Mahalanobis distance

cluster cardinality

clustering methods much better

4.2.1 Extensions to Arbitrarily Shaped Clusters

global embeddings

Local data distribution

4.3 Distance-Based Outlier Analysis

4.3.1 Scoring Outputs for Distance-Based Methods

Exact k-Nearest Neighbor Score

Average k-Nearest Neighbor Score

Correct k

Harmonic k-Nearest Neighbor Score (Rarely used)

Remove any repeated points from the data set. (Would always be zero)

Simply setting k = N

4.3.2 Binary Outputs for Distance-Based Methods

To determine outlier

4.3.2.1 Cell-Based Pruning

4.3.2.2 Sampling-Based Pruning

4.3.2.3 Index-Based Pruning

4.3.3 Data-Dependent Similarity Measures

shared nearest neighbor similarity measure (locality-sensitive similarity measure)

Idea:

Disadvantages:

Solution:

pairwise Mahalanobis distance (correlation-sensitive distance measure)

Idea:

Random froests (data-dependent similarity )

4.3.4 ODIN: A Reverse Nearest Neighbor Approach

Disadvatages:

4.3.5 Intensional Knowledge of Distance-Based Outliers

4.3.6 Discussion of Distance-Based Method

more detailed granularity

4.4 Density-Based Outliers

4.4.1 LOF: Local Outlier Factor

4.5 Limitations of Proximity-Based Detection

Chapter 5 High_Dimensional Outlier Detection:

5.1 Introduction