離群點檢測方法詳解

離群點（Outlier）在數據分析中經常出現，它們是指與大多數數據點顯著不同的數據點。數據中的離群點可能是異常值、錯誤或者不符合數據模型的數據。所以在數據分析中需要將離群點進行檢測和剔除。下面詳細介紹離群點檢測方法相關知識。

一、離群點定義

對數據集進行統計學分析時，離群點經常是需要關注的問題。一般來說，離群值可以定義為數據集中顯著偏離其他觀測值的單個觀測值。這些值通常是數據中存在的異常值或者數據的錯誤。

離群點根據異常值的分布，通常分為兩種類型：

點異常（Point Anomaly）：是指單個數據點在整個數據集上囂張突出。通常，這些異常點是數據集中的錯誤或雜訊。例如，由於感測器失效導致的異常數據。
上下文異常（Contextual Anomaly）：通常發生在數據集的某個區域，而不僅僅是單個數據點。例如，在普通時段中，每天的銀行交易通常在正常範圍內。但在大型交易活動之後，數量可能會顯著偏離正常範圍。

二、離群點檢測方法

1. 簡單統計學方法

簡單統計學方法是最基本和最簡單的離群點檢測方法。它使用均值或中位數來找到偏差。如果一個數據點大於或小於設定的閾值，則將其標記為離群點。

這是一個快速的方法，但它只限於數據分布正常，存在明顯偏差並且離群值不太常見的情況。下面是使用Python實現的簡單統計學方法：

import numpy as np

def is_outlier(points, threshold=3.5):
    """
    較低的分位數、較高的分位數、分位距 計算 離群點界限
    :param points:
    :param threshold:
    :return:
    """
    if len(points.shape) == 1:
        # 維度是1的
        points = points[:, None]
    # calculate the median value
    median = np.median(points, axis=0)
    # calculate the quartile values
    #1/4分位數
    q25, q75 = np.percentile(points, [25, 75], axis=0)
    q_dist = q75 - q25
    print("Median: ", median)
    print("Q25: ", q25)
    print("Q75: ", q75)
    print("Q Distance: ", q_dist)
    # calculate the outlier cutoff
    cut_off = threshold * q_dist
    outlier_min = q25 - cut_off
    outlier_max = q75 + cut_off
    return outlier_min, outlier_max

# create dataset
np.random.seed(0)
x = np.random.randn(100)
# add in some outliers
x = np.concatenate((x, np.array([ -10, 10, -15, 15])))
# detect outliers
min_out, max_out = is_outlier(x, threshold=2)
print("Mininum cutoff: ", min_out)
print("Maximum cutoff: ", max_out)

2. DBSCAN（Density-Based Spatial Clustering of Applications with Noise）

DBSCAN是一種密度聚類演算法，可以識別以高密度區域為中心的區域，並將低密度區域標記為雜訊。它的輸出同時包括集群和雜訊，因此比簡單的統計方法更具描述性。

它的主要優點是可以自適應地匹配數據集的密度分布。下面是使用Python實現的DBSCAN演算法：

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Import data
data = pd.read_csv("data.csv")

# Standardize the data
X = StandardScaler().fit_transform(data)

# DBSCAN Clustering
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_

# results
print("Data points: ", len(X))
print("No. clusters:", len(set(labels)))

3. Elliptic Envelope

橢圓信封法允許您對數據進行建模，以確定正常點與離群點之間的邊界。該演算法使用最大似然方法來擬合橢圓，並標記遠離橢圓中心的點為離群點。

由於該演算法是在高斯分布假設下計算的，因此對於非高斯分布的數據可能不適用。下面是使用Python實現的Elliptic Envelope演算法：

from sklearn.covariance import EllipticEnvelope
import numpy as np

rng = np.random.RandomState(42)
# Generate data
n_samples = 200
n_outliers = 20
X = 0.3 * rng.randn(n_samples, 2)

# Add outliers
X[:n_outliers] = 2 + 0.3 * rng.randn(n_outliers, 2)

# Fit the Elliptic Envelope
clf = EllipticEnvelope(random_state=rng, contamination=0.1)
clf.fit(X)

# Predict Outliers
y_pred = clf.predict(X)

4. Isolation Forest

孤立森林是一種基於樹的方法，適用於高維數據。它通過對所有非離群點進行隨機分割，並基於規則計算所有分割路徑的平均長度來查找離群點。孤立森林的主要優點是適用於修飾數據集，且計算成本較低。

下面是使用Python實現的孤立森林演算法：

from sklearn.ensemble import IsolationForest
import pandas as pd

data = pd.read_csv("data.csv").iloc[:, 1:]

isof = IsolationForest(n_estimators=50, max_samples='auto', contamination='auto', max_features=1.0,
                        bootstrap=False, n_jobs=1, random_state=None, verbose=0, behaviour='deprecated')

isof.fit(data)
pred = isof.predict(data)

print("Outlier values: ", pred[pred == -1])

總結

本文主要介紹了離群點檢測方法。這些方法包括簡單統計學方法、DBSCAN、Elliptic Envelope和Isolation Forest。每種方法都有其優點和局限性，因此需要根據具體數據集的特點選擇適當的方法。

原創文章，作者：小藍，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/277127.html