一、什麼是SGD優化算法
SGD(Stochastic Gradient Descent)優化算法是機器學習中常用的優化方法,它是一種迭代式的優化方法,用於尋找損失函數的最小值。相較於傳統的梯度下降算法,SGD每次隨機選擇一個樣本進行迭代,因此可以避免一些複雜的計算量。常見的SGD算法包括Mini-Batch SGD和普通SGD。
SGD算法的核心思想是通過不斷的迭代尋找當損失函數的梯度為0時的參數取值,從而實現對模型的優化。在訓練過程中,SGD每次選取一個樣本計算梯度,並根據梯度的方向以一定的步長調節參數,以期逐步降低損失函數的值,直到達到預定的收斂精度。
二、SGD優化算法的優點
相較於傳統的梯度下降算法,SGD算法具有以下優點:
1、節省內存:每次只需要處理一個樣本,可以節省大量的內存空間;
2、處理高維數據效果好:當數據維度比較高時,SGD算法具有更好的效果;
3、收斂速度快:由於每次迭代只處理一個樣本,因此收斂的速度很快。
三、SGD優化算法在機器學習中的應用
SGD優化算法被廣泛應用在機器學習領域,特別是在深度學習中。以下是一些常見的機器學習模型中使用SGD算法優化的例子:
1、線性回歸模型中的SGD優化器;
class LinearRegression:
def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
self.lr = lr
self.num_iter = num_iter
self.fit_intercept = fit_intercept
self.verbose = verbose
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis=1)
def __loss(self, h, y):
return (1/2*len(y)) * np.sum((h-y)**2)
def fit(self, X, y):
if self.fit_intercept:
X = self.__add_intercept(X)
self.theta = np.zeros(X.shape[1])
for i in range(self.num_iter):
rand_idx = np.random.randint(0, X.shape[0])
X_i = X[rand_idx,:]
y_i = y[rand_idx]
h = np.dot(X_i, self.theta)
gradient = X_i.T.dot(h-y_i)
self.theta -= self.lr * gradient
if self.verbose and i % 10000 == 0:
h = np.dot(X, self.theta)
print(f'Iteration {i}, loss = {self.__loss(h, y)}')
def predict(self, X):
if self.fit_intercept:
X = self.__add_intercept(X)
return np.dot(X, self.theta)
2、邏輯回歸模型中的SGD優化器;
class LogisticRegression:
def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
self.lr = lr
self.num_iter = num_iter
self.fit_intercept = fit_intercept
self.verbose = verbose
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis=1)
def __sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def __loss(self, h, y):
return (-1/len(y)) * np.sum(y*np.log(h) + (1-y)*np.log(1-h))
def fit(self, X, y):
if self.fit_intercept:
X = self.__add_intercept(X)
self.theta = np.zeros(X.shape[1])
for i in range(self.num_iter):
rand_idx = np.random.randint(0, X.shape[0])
X_i = X[rand_idx,:]
y_i = y[rand_idx]
z = np.dot(X_i, self.theta)
h = self.__sigmoid(z)
gradient = X_i.T.dot(h-y_i)
self.theta -= self.lr * gradient
if self.verbose and i % 10000 == 0:
z = np.dot(X, self.theta)
h = self.__sigmoid(z)
print(f'Iteration {i}, loss = {self.__loss(h, y)}')
def predict_proba(self, X):
if self.fit_intercept:
X = self.__add_intercept(X)
return self.__sigmoid(np.dot(X, self.theta))
def predict(self, X, threshold=0.5):
return self.predict_proba(X) >= threshold
3、神經網絡中的SGD優化器。
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
class NeuralNetwork:
def __init__(self, lr=0.01, num_iter=100, hidden_size=4, fit_intercept=True, verbose=False):
self.lr = lr
self.num_iter = num_iter
self.fit_intercept = fit_intercept
self.verbose = verbose
self.hidden_size = hidden_size
def __add_intercept(self, X):
intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis=1)
def __loss(self, y, y_hat):
return -np.mean(y * np.log(y_hat) + (1-y) * np.log(1-y_hat))
def initialize_parameters(self, X):
input_size = X.shape[1]
output_size = 1
self.params = {
'W1': np.random.randn(input_size, self.hidden_size) * 0.01,
'b1': np.zeros((1, self.hidden_size)),
'W2': np.random.randn(self.hidden_size, output_size) * 0.01,
'b2': np.zeros((1, output_size))
}
def forward_propagation(self, X):
Z1 = np.dot(X, self.params['W1']) + self.params['b1']
A1 = np.tanh(Z1)
Z2 = np.dot(A1, self.params['W2']) + self.params['b2']
y_hat = sigmoid(Z2)
cache = {
'A1': A1,
'Z2': Z2,
'Z1': Z1
}
return y_hat, cache
def backward_propagation(self, X, y, y_hat, cache):
dZ2 = y_hat - y
dW2 = np.dot(cache['A1'].T, dZ2) / X.shape[0]
db2 = np.sum(dZ2, axis=0, keepdims=True) / X.shape[0]
dZ1 = np.dot(dZ2, self.params['W2'].T) * (1 - np.power(cache['A1'], 2))
dW1 = np.dot(X.T, dZ1) / X.shape[0]
db1 = np.sum(dZ1, axis=0, keepdims=True) / X.shape[0]
grads = {
'dW2': dW2,
'db2': db2,
'dW1': dW1,
'db1': db1
}
return grads
def update_parameters(self, grads):
self.params['W1'] -= self.lr * grads['dW1']
self.params['b1'] -= self.lr * grads['db1']
self.params['W2'] -= self.lr * grads['dW2']
self.params['b2'] -= self.lr * grads['db2']
def fit(self, X, y):
if self.fit_intercept:
X = self.__add_intercept(X)
self.initialize_parameters(X)
for i in range(self.num_iter):
y_hat, cache = self.forward_propagation(X)
loss = self.__loss(y, y_hat)
grads = self.backward_propagation(X, y, y_hat, cache)
self.update_parameters(grads)
if self.verbose and i % 10 == 0:
print(f'Iteration {i}, loss = {loss}')
def predict(self, X):
if self.fit_intercept:
X = self.__add_intercept(X)
y_hat, _ = self.forward_propagation(X)
return y_hat
四、如何選擇SGD的超參數
SGD算法中的超參數主要有學習率(learning rate)、迭代次數、batch size等等。合適的超參數對於模型性能的提升至關重要,以下是一些常用的選擇方法:
1、學習率(learning rate):通常情況下,學習率的選擇會根據具體的數據集和模型來定。如果學習率過大,可能導致算法無法收斂;如果學習率過小,則算法的收斂速度會變得非常緩慢。通常情況下,我們可以初次設置學習率等於0.001,然後根據具體實驗進行調整;
2、迭代次數:迭代次數應該足夠大,以保證算法能夠收斂到最優解。同時,迭代次數也不能太大,否則會導致算法耗費大量的時間和資源。一般來說,可以根據實際數據集大小和模型複雜度來確定迭代次數;
3、batch size:batch size通常是一個比較小的數值,但不宜過小或過大。如果batch size過大,將會導致算法內存不足,而過小則會影響模型的優化效果。因此,我們需要根據具體的數據集大小和配置環境確定一個合適的batch size。
原創文章,作者:小藍,如若轉載,請註明出處:https://www.506064.com/zh-hant/n/302890.html
微信掃一掃
支付寶掃一掃