Learning Different Techniques of Anomaly Detection

Learning Different Techniques of Anomaly Detection

Source Node: 1867449

We used Standard Scaler to make our data normal and plotted a scatter graph.

Now we will import DBSCAN to give points to the clusters. If it fails, it will show -1.

Now we have the results, but how do we check which value is min, max and whether we have -1 values? We will use the arg min value to check the smallest value in the cluster.

We can see from the result six values which are -1.

Lets now plot a scatter graph.

The above methods we applied are on uni-variate outliers.

For Multi-variates outliers detections, we need to understand the multi-variate outliers.

For example, we take Car readings. We have seen two reading meters one for the odometer, which records or measures the speed at which the vehicle is moving, and the second is the rpm reading which records the number of rotations made by the car wheel per minute.

Suppose the odometer shows in the range of 0-60 mph and rpm in 0-750. We assume that all the values which come should correlate with each other. If the odometer shows a 50 speed and the rpm shows 0 intakes, readings are incorrect. If the odometer shows a value more than zero, that means the car was moving, so the rpm should have higher values, but in our case, it shows a 0 value. i.e., Multi-variate outliers.

C. Mahalanobis Distance Method

In DBSCAN, we used euclidean distance metrics, but in this case, we are talking about the Mahalanobis distance method. We can also use Mahalanobis distance with DBSCAN.

DBSCAN(eps=0.5, min_samples=3, metric='mahalanobis', metric_params={'V':np.cov(X)}, algorithm='brute', leaf_size=30, n_jobs=-1)

Why is Euclidean unfit for entities cor-related to each other? Euclidean distance cannot find or will give incorrect data on how close are the two points.

Mahalanobis method uses the distance between points and distribution that is clean data. Euclidean distance is often between two points, and its z-score is calculated by x minus mean and divided by standard deviation. In Mahalanobis, the z-score is x minus the mean divided by the covariance matrix.

Therefore, what effect does dividing by the covariance matrix have? The covariance values will be high if the variables in your dataset are highly correlated.

Similarly, if the covariance values are low, the distance is not significantly reduced if the data are not correlated. It does so well that it addresses both the scale and correlation of the variables issues.

Code

Dataset can be taken from Anomaly-/caret.csv at main · aster28/Anomaly- (github.com)

df = pd.read_csv('caret.csv').iloc[:, [0,4,6]]
df.head()

We defined the function distance as x= None, data= None, and Covariance = None. Inside the function, we took the mean of data and used the covariance value of the value there. Otherwise, we will calculate the covariance matrix. T stands for transpose.

For example, if the array size is five or six and you want it to be in two variables, then we need to transpose the matrix.

np.random.multivariate_normal(mean, cov, size = 5)
array([[ 0.0509196, 0.536808 ], [ 0.1081547, 0.9308906], [ 0.4545248, 1.4000731], [ 0.9803848, 0.9660610], [ 0.8079491 , 0.9687909]])
np.random.multivariate_normal(mean, cov, size = 5).T
array([[ 0.0586423, 0.8538419, 0.2910855, 5.3047358, 0.5449706], [ 0.6819089, 0.8020285, 0.7109037, 0.9969768, -0.7155739]])

We used sp.linalg, which is Linear algebra and has different functions to be performed on linear algebra. It has the inv function for the inversion of the matrix. NumPy dot as means for the multiplication of the matrix.

import scipy as sp
def distance(x=None, data=None, cov=None): x_m = x - np.mean(data) if not cov: cov = np.cov(data.values.T) inv_cov = sp.linalg.inv(cov) left = np.dot(x_m, inv_cov) m_distance = np.dot(left, x_m.T) return m_distance.diagonal()
df_g= df[['carat', 'depth', 'price']].head(50)
df_g['m_distance'] = distance(x=df_g, data=df[['carat', 'depth', 'price']])
df_g.head()

B. Tukey’s method for outlier detection

Tukey method is also often called Box and Whisker or Box plot method.

Tukey method utilizes the Upper and lower range.

Upper range = 75th Percentile -k*IQR

Lower range = 25th Percentile + k* IQR

Let us see our Titanic data with age variable using a box plot.

sns.boxplot(titanic['age'].values)

We can see in the image the box blot created by Seaborn shows many dots between the age of 55 and 80 are outliers not within the quartiles. We will detect lower and upper range by making a function outliers_detect.

def outliers_detect(x, k = 1.5): x = np.array(x).copy().astype(float) first = np.quantile(x, .25) third = np.quantile(x, .75) # IQR calculation iqr = third - first #Upper range and lower range lower = first - (k * iqr) upper = third + (k * iqr) return lower, upper
outliers_detect(titanic['age'], k = 1.5)
(2.5, 54.5)

D. Detection by PyCaret

We will be using the same dataset for detection by PyCaret.

from pycaret.anomaly import *
setup_anomaly_data = setup(df)

Pycaret is an open-source machine learning which uses an unsupervised learning model to detect outliers. It has a get_data method for using the dataset in pycaret itself, set_up for preprocessing task before detection, usually takes data frame but also has many other features like ignore_features, etc.

Other methods create_model for using an algorithm. We will first use Isolation Forest.

ifor = create_model("iforest")
plot_model(ifor)
ifor_predictions = predict_model(ifor, data = df)
print(ifor_predictions)
ifor_anomaly = ifor_predictions[ifor_predictions["Anomaly"] == 1]
print(ifor_anomaly.head())
print(ifor_anomaly.shape)

Anomaly 1 indicates outliers, and Anomaly 0 shows no outliers.

The yellow color here indicates outliers.

Now let us see another algorithm, K Nearest Neighbors (KNN)

knn = create_model("knn")
plot_model(knn)
knn_pred = predict_model(knn, data = df)
print(knn_pred)
knn_anomaly = knn_pred[knn_pred["Anomaly"] == 1]
knn_anomaly.head()
knn_anomaly.shape

Now we will use a clustering algorithm.

clus = create_model("cluster")
plot_model(clus)
clus_pred = predict_model(clus, data = df)
print(clus_pred)
clus_anomaly = clus_predictions[clus_pred["Anomaly"] == 1]
print(clus_anomaly.head())
clus_anomaly.shape

E. Anomaly Detection by PyOD

PyOD is a python library for the detection of outliers in multivariate data. It is good both for supervised and unsupervised learning.

from pyod.models.iforest import IForest
from pyod.models.knn import KNN 

We imported the library and algorithm.

from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
train= 300
test=100
contaminate = 0.1
X_train, X_test, y_train, y_test = generate_data(n_train=train, n_test=test, n_features=2,contamination=contaminate,random_state=42)
cname_alg = 'KNN' # the name of the algorithm is K Nearest Neighbors
c = KNN()
c.fit(X_train) #Fit the algorithm
y_trainpred = c.labels_ y_trainscores = c.decision_scores_ y_testpred = c.predict(X_test) y_testscores = c.decision_function(X_test) print("Training Data:")
evaluate_print(cname_alg, y_train, y_train_scores)
print("Test Data:")
evaluate_print(cname_alg, y_test, y_test_scores)
visualize(cname_alg, X_train, y_train, X_test, y_test, y_trainpred,y_testpred, show_figure=True, save_figure=True)

We will use the IForest algorithm.

fname_alg = 'IForest' # the name of the algorithm is K Nearest Neighbors
f = IForest()
f.fit(X_train) #Fit the algorithm
y_train_pred = c.labels_
y_train_scores = c.decision_scores_
y_test_pred = c.predict(X_test)
y_test_scores = c.decision_function(X_test)
print("Training Data:")
evaluate_print(fname_alg, y_train_pred, y_train_scores)
print("Test Data:")
evaluate_print(fname_alg, y_test_pred, y_test_scores)
visualize(fname_alg, X_train, y_train, X_test, y_test_pred, y_train_pred,y_test_pred, show_figure=True, save_figure=True)

F. Anomaly Detection by Prophet

We will use the  air passenger dataset with time series prophet/example_air_passengers.csv at main · aster28/prophet (github.com)

import prophet
from prophet import forecaster
from prophet import Prophet
m = Prophet()
data = pd.read_csv('air_pass.csv')
data.head()
data.columns = ['ds', 'y']
data['y'] = np.where(data['y'] != 0, np.log(data['y']), 0)

The Log of the y column enables no negative value. We split our data into train, test, and stored the prediction in the variable forecast.

train, test= train_test_split(data, random_state =42)
m.fit(train[['ds','y']])
forecast = m.predict(test)
def detect(forecast): forcast = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].copy() forcast['real']= data['y'] forcast['anomaly'] =0 forcast.loc[forcast['real']> forcast['yhat_upper'], 'anomaly']=1 forcast.loc[forcast['real']< forcast['yhat_lower'], 'anomaly']=-1 forcast['imp']=0 in_range = forcast['yhat_upper']-forcast['yhat_lower'] forcast.loc[forcast['anomaly']==1, 'imp'] = forcast['real']-forcast['yhat_upper']/in_range forcast.loc[forcast['anomaly']==-1, 'imp']= forcast['yhat_lower']-forcast['real']/in_range return forcast
detect(forecast)

We took the anomaly as -1.

Conclusion

The process of finding outliers in a given dataset is called anomaly detection. Outliers are data objects that stand out from the rest of the object values in the dataset and don’t behave normally.

Anomaly detection tasks can use distance-based and density-based clustering methods to identify outliers as a cluster.

We here discuss anomaly detection’s various methods and explain them using the code on three datasets of Titanic, Air passengers, and Caret to

Key Points

1. Outliers or anomaly detection can be detected using the Box-Whisker method or by DBSCAN.

2. Euclidean distance method is used with the items not correlated.

3. Mahalanobis method is used with Multivariate outliers.

4. All the values or points are not outliers. Some are noises that ought to be garbage. Outliers are valid data that need to be adjusted.

5. We used PyCaret for outliers detection by using different algorithms where the anomaly is one, shown in yellow colors, and no outliers where the outlier is 0.

6. We used PyOD, which is the Python Outlier detection library. It has more than 40 algorithms. Supervised and unsupervised techniques it is used.

7. We used Prophet and defined the function detect to outline the outliers.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Time Stamp:

More from Analytics Vidhya