日韩在线一区二区不卡视频,日韩无人区码卡二卡3卡2022

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf

# Download time series data using yfinance
data = yf.download('AAPL', start='2018-01-01', end='2023-06-30')

理解時(shí)間序列數(shù)據(jù)

在深入研究異常檢測(cè)技術(shù)之前，先簡(jiǎn)單介紹時(shí)間序列數(shù)據(jù)的特征。時(shí)間序列數(shù)據(jù)通常具有以下屬性:

趨勢(shì):數(shù)據(jù)值隨時(shí)間的長(zhǎng)期增加或減少。
季節(jié)性:以固定間隔重復(fù)的模式或循環(huán)。
自相關(guān):當(dāng)前觀測(cè)值與先前觀測(cè)值之間的相關(guān)性。
噪聲:數(shù)據(jù)中的隨機(jī)波動(dòng)或不規(guī)則。

讓我們可視化下載的時(shí)間序列數(shù)據(jù)

# Plot the time series data

 plt.figure(figsize=(12, 6))

 plt.plot(data['Close'])

 plt.xlabel('Date')

 plt.ylabel('Closing Price')

 plt.title('AAPL Stock Price')

 plt.xticks(rotation=45)

 plt.grid(True)



 plt.show()

從圖中可以觀察到股票價(jià)格隨時(shí)間增長(zhǎng)的趨勢(shì)。也有周期性波動(dòng)，表明季節(jié)性的存在。連續(xù)收盤價(jià)之間似乎存在一些自相關(guān)性。

時(shí)間序列數(shù)據(jù)預(yù)處理

在應(yīng)用異常檢測(cè)技術(shù)之前，對(duì)時(shí)間序列數(shù)據(jù)進(jìn)行預(yù)處理是至關(guān)重要的。預(yù)處理包括處理缺失值、平滑數(shù)據(jù)和去除異常值。

缺失值

由于各種原因，如數(shù)據(jù)收集錯(cuò)誤或數(shù)據(jù)中的空白，時(shí)間序列數(shù)據(jù)中可能出現(xiàn)缺失值。適當(dāng)?shù)靥幚砣笔е狄员苊夥治鲋械钠钍潜匾摹?/p>

 # Check for missing values

 missing_values = data.isnull().sum()

 print(missing_values)

我們使用的股票數(shù)據(jù)數(shù)據(jù)不包含任何缺失值。如果存在缺失值，可以通過(guò)輸入缺失值或刪除相應(yīng)的時(shí)間點(diǎn)來(lái)處理它們。

平滑數(shù)據(jù)

對(duì)時(shí)間序列數(shù)據(jù)進(jìn)行平滑處理有助于減少噪聲并突出顯示潛在的模式。平滑時(shí)間序列數(shù)據(jù)的一種常用技術(shù)是移動(dòng)平均線。

 # Smooth the time series data using a moving average

 window_size = 7

 data['Smoothed'] = data['Close'].rolling(window_size).mean()



 # Plot the smoothed data

 plt.figure(figsize=(12, 6))

 plt.plot(data['Close'], label='Original')

 plt.plot(data['Smoothed'], label='Smoothed')

 plt.xlabel('Date')

 plt.ylabel('Closing Price')

 plt.title('AAPL Stock Price (Smoothed)')

 plt.xticks(rotation=45)

 plt.legend()

 plt.grid(True)



 plt.show()

該圖顯示了原始收盤價(jià)和使用移動(dòng)平均線獲得的平滑版本。平滑有助于整體趨勢(shì)的可視化和減少短期波動(dòng)的影響。

去除離群值

異常異常值會(huì)顯著影響異常檢測(cè)算法的性能。在應(yīng)用異常檢測(cè)技術(shù)之前，識(shí)別和去除異常值是至關(guān)重要的。

# Calculate z-scores for each data point

 z_scores = (data['Close'] - data['Close'].mean()) / data['Close'].std()



 # Define a threshold for outlier detection

 threshold = 3



 # Identify outliers

 outliers = data[np.abs(z_scores) > threshold]



 # Remove outliers from the data

 data = data.drop(outliers.index)



 # Plot the data without outliers

 plt.figure(figsize=(12, 6))

 plt.plot(data['Close'])

 plt.xlabel('Date')

 plt.ylabel('Closing Price')

 plt.title('AAPL Stock Price (Without Outliers)')

 plt.xticks(rotation=45)

 plt.grid(True)



 plt.show()

上圖顯示了去除識(shí)別的異常值后的時(shí)間序列數(shù)據(jù)。通過(guò)減少極值的影響，去除異常值有助于提高異常檢測(cè)算法的準(zhǔn)確性。

有人會(huì)說(shuō)了，我們不就是要檢測(cè)異常值嗎，為什么要將它刪除呢？這是因?yàn)椋覀冞@里刪除的異常值是非常明顯的值，也就是說(shuō)這個(gè)預(yù)處理是初篩，或者叫粗篩。把非常明顯的值刪除，這樣模型可以更好的判斷哪些難判斷的值。

統(tǒng)計(jì)方法

統(tǒng)計(jì)方法為時(shí)間序列數(shù)據(jù)的異常檢測(cè)提供了基礎(chǔ)。我們將探討兩種常用的統(tǒng)計(jì)技術(shù):z-score和移動(dòng)平均。

z-score

z-score衡量的是觀察值離均值的標(biāo)準(zhǔn)差數(shù)。通過(guò)計(jì)算每個(gè)數(shù)據(jù)點(diǎn)的z分?jǐn)?shù)，我們可以識(shí)別明顯偏離預(yù)期行為的觀測(cè)值。

 # Calculate z-scores for each data point

 z_scores = (data['Close'] - data['Close'].mean()) / data['Close'].std()



 # Plot the z-scores

 plt.figure(figsize=(12, 6))

 plt.plot(z_scores)

 plt.xlabel('Date')

 plt.ylabel('Z-Score')

 plt.title('Z-Scores for AAPL Stock Price')

 plt.xticks(rotation=45)

 plt.axhline(y=threshold, color='r', linestyle='--', label='Threshold')

 plt.axhline(y=-threshold, color='r', linestyle='--')

 plt.legend()

 plt.grid(True)



 plt.show()

該圖顯示了每個(gè)數(shù)據(jù)點(diǎn)的計(jì)算z-score。z-score高于閾值(紅色虛線)的觀測(cè)值可視為異常。

移動(dòng)平均線

另一種異常檢測(cè)的統(tǒng)計(jì)方法是基于移動(dòng)平均線。通過(guò)計(jì)算移動(dòng)平均線并將其與原始數(shù)據(jù)進(jìn)行比較，我們可以識(shí)別與預(yù)期行為的偏差。

 # Calculate the moving average

 window_size = 7

 moving_average = data['Close'].rolling(window_size).mean()



 # Calculate the deviation from the moving average

 deviation = data['Close'] - moving_average



 # Plot the deviation

 plt.figure(figsize=(12, 6))

 plt.plot(deviation)

 plt.xlabel('Date')

 plt.ylabel('Deviation')

 plt.title('Deviation from Moving Average')

 plt.xticks(rotation=45)

 plt.axhline(y=0, color='r', linestyle='--', label='Threshold')

 plt.legend()

 plt.grid(True)



 plt.show()

該圖顯示了每個(gè)數(shù)據(jù)點(diǎn)與移動(dòng)平均線的偏差。正偏差表示值高于預(yù)期行為，而負(fù)偏差表示值低于預(yù)期行為。

機(jī)器學(xué)習(xí)方法

機(jī)器學(xué)習(xí)方法為時(shí)間序列數(shù)據(jù)的異常檢測(cè)提供了更先進(jìn)的技術(shù)。我們將探討兩種流行的機(jī)器學(xué)習(xí)算法:孤立森林和LSTM Autoencoder。

孤立森林

孤立森林是一種無(wú)監(jiān)督機(jī)器學(xué)習(xí)算法，通過(guò)將數(shù)據(jù)隨機(jī)劃分為子集來(lái)隔離異常。它測(cè)量隔離觀察所需的平均分區(qū)數(shù)，而異常情況預(yù)計(jì)需要更少的分區(qū)。

 from sklearn.ensemble import IsolationForest



 # Prepare the data for Isolation Forest

 X = data['Close'].values.reshape(-1, 1)



 # Train the Isolation Forest model

 model = IsolationForest(contamination=0.05)

 model.fit(X)



 # Predict the anomalies

 anomalies = model.predict(X)



 # Plot the anomalies

 plt.figure(figsize=(12, 6))

 plt.plot(data['Close'])

 plt.scatter(data.index, data['Close'], c=anomalies, cmap='cool', label='Anomaly')

 plt.xlabel('Date')

 plt.ylabel('Closing Price')

 plt.title('AAPL Stock Price with Anomalies (Isolation Forest)')

 plt.xticks(rotation=45)

 plt.legend()

 plt.grid(True)



 plt.show()

該圖顯示了由孤立森林算法識(shí)別的異常時(shí)間序列數(shù)據(jù)。異常用不同的顏色突出顯示，表明它們偏離預(yù)期行為。

LSTM Autoencoder

LSTM (Long – Short-Term Memory)自編碼器是一種深度學(xué)習(xí)模型，能夠?qū)W習(xí)時(shí)間序列數(shù)據(jù)中的模式并重構(gòu)輸入序列。通過(guò)將重建誤差與預(yù)定義的閾值進(jìn)行比較，可以檢測(cè)異常。

from tensorflow.keras.models import Sequential

 from tensorflow.keras.layers import LSTM, Dense



 # Prepare the data for LSTM Autoencoder

 X = data['Close'].values.reshape(-1, 1)



 # Normalize the data

 X_normalized = (X - X.min()) / (X.max() - X.min())



 # Train the LSTM Autoencoder model

 model = Sequential([

    LSTM(64, activation='relu', input_shape=(1, 1)),

    Dense(1)

 ])

 model.compile(optimizer='adam', loss='mse')

 model.fit(X_normalized, X_normalized, epochs=10, batch_size=32)



 # Reconstruct the input sequence

 X_reconstructed = model.predict(X_normalized)



 # Calculate the reconstruction error

 reconstruction_error = np.mean(np.abs(X_normalized - X_reconstructed), axis=1)



 # Plot the reconstruction error

 plt.figure(figsize=(12, 6))

 plt.plot(reconstruction_error)

 plt.xlabel('Date')

 plt.ylabel('Reconstruction Error')

 plt.title('Reconstruction Error (LSTM Autoencoder)')

 plt.xticks(rotation=45)

 plt.axhline(y=threshold, color='r', linestyle='--', label='Threshold')

 plt.legend()

 plt.grid(True)



 plt.show()

該圖顯示了每個(gè)數(shù)據(jù)點(diǎn)的重建誤差。重建誤差高于閾值(紅色虛線)的觀測(cè)值可視為異常。

異常檢測(cè)模型的評(píng)估

為了準(zhǔn)確地評(píng)估異常檢測(cè)模型的性能，需要有包含有關(guān)異常存在或不存在的信息的標(biāo)記數(shù)據(jù)。但是在現(xiàn)實(shí)場(chǎng)景中，獲取帶有已知異常的標(biāo)記數(shù)據(jù)幾乎不可能，所以可以采用替代技術(shù)來(lái)評(píng)估這些模型的有效性。

最常用的一種技術(shù)是交叉驗(yàn)證，它涉及將可用的標(biāo)記數(shù)據(jù)分成多個(gè)子集或折疊。模型在數(shù)據(jù)的一部分上進(jìn)行訓(xùn)練，然后在剩余的部分上進(jìn)行評(píng)估。這個(gè)過(guò)程重復(fù)幾次，并對(duì)評(píng)估結(jié)果進(jìn)行平均，以獲得對(duì)模型性能更可靠的估計(jì)。

當(dāng)標(biāo)記數(shù)據(jù)不容易獲得時(shí)，也可以使用無(wú)監(jiān)督評(píng)估度量。這些指標(biāo)基于數(shù)據(jù)本身固有的特征(如聚類或密度估計(jì))來(lái)評(píng)估異常檢測(cè)模型的性能。無(wú)監(jiān)督評(píng)價(jià)指標(biāo)的例子包括輪廓系數(shù)silhouette score、鄧恩指數(shù)Dunn index或平均最近鄰距離。

總結(jié)

本文探索了使用機(jī)器學(xué)習(xí)進(jìn)行時(shí)間序列異常檢測(cè)的各種技術(shù)。首先對(duì)其進(jìn)行預(yù)處理，以處理缺失值，平滑數(shù)據(jù)并去除異常值。然后討論了異常檢測(cè)的統(tǒng)計(jì)方法，如z-score和移動(dòng)平均。最后探討了包括孤立森林和LSTM自編碼器在內(nèi)的機(jī)器學(xué)習(xí)方法。

異常檢測(cè)是一項(xiàng)具有挑戰(zhàn)性的任務(wù)，需要對(duì)時(shí)間序列數(shù)據(jù)有深入的了解，并使用適當(dāng)?shù)募夹g(shù)來(lái)發(fā)現(xiàn)異常模式和異常值。記住要嘗試不同的算法，微調(diào)參數(shù)并評(píng)估模型的性能，以獲得最佳結(jié)果。

本文章轉(zhuǎn)載微信公眾號(hào)@算法進(jìn)階