詳解model_selection

一、交叉驗證

模型的性能指標和泛化能力是一個模型是否好壞的重要標誌，但是為了確保模型可以擬合和處理數據，我們需要將數據拆分成訓練集和測試集，但是這樣仍然存在一個問題：如何保證模型不會受到數據集的噪音干擾？交叉驗證可以解決這個問題。

交叉驗證是一種將數據拆分成幾個子集的技術，並使用一些子集來訓練模型，而另外的子集用來驗證模型的方法。最常見的交叉驗證方法是K折交叉驗證。K折意味着將數據分成K個子集，其中每個子集都用來一次驗證模型，並使用其他K-1個子集來訓練模型。


from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression

# 讀取數據
data = pd.read_csv('data.csv')

# 準備數據
X = data.drop('target', axis=1).values
y = data['target'].values

# 定義交叉驗證折數
kfolds = KFold(n_splits=5, shuffle=True, random_state=1234)

# 創建模型
model = LinearRegression()

# 使用交叉驗證評估模型
scores = cross_val_score(model, X, y, cv=kfolds)

# 輸出交叉驗證的得分
print("交叉驗證得分: ", scores.mean())

二、網格搜索調優

調整參數是機器學習中必須的一步，網格搜索技術可以幫助我們找到最好的參數組合。在網格搜索中，我們可以為每個參數定義一個列表，並計算所得的模型對配對參數的得分。

一個實際的例子是使用 SVM 來分類鳶尾花數據集。我們可以使用網格搜索調整 SVM 模型的內核和參數。首先，定義一個參數字典和一個評分器對象：


from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV

# 加載鳶尾花數據集
iris = datasets.load_iris()

# 準備數據
X = iris.data
y = iris.target

# 定義參數字典
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

# 創建分類器對象
svc = svm.SVC()

# 創建 GridSearchCV 對象，設置參數和折數
clf = GridSearchCV(svc, parameters, cv=5)

# 以數據進行訓練
clf.fit(X, y)

# 輸出最佳得分和最佳參數
print("最佳得分：", clf.best_score_)
print("最佳參數：", clf.best_params_)

三、Pipeline

在機器學習中，我們通常需要多次進行轉換或建模操作。Pipeline 是一個可以使這些步驟更容易處理的工具。Pipeline 提供並行的特徵提取和模型訓練，可以節省大量代碼行。

Pipeline 對象是一個操作序列，它可以容納估計器對象和用於轉換數據的處理器對象。該序列定義了執行數據操作和機器學習任務的順序。


from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

# 定義 PCA 和 RandomForestClassifier 對象
pca = PCA()
rf = RandomForestClassifier()

# 多個步驟的 Pipeline 對象定義
pipeline = Pipeline(steps=[('pca', pca), ('randomforestclassifier', rf)])

# GridSearch 的參數
param_grid = {
    'pca__n_components': [5, 15, 30, 45, 64],
    'randomforestclassifier__n_estimators': [10, 50, 100, 200],
    'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']
}

# 運行 GridSearch 交叉驗證以查找最佳參數
search = GridSearchCV(pipeline, param_grid, iid=False, cv=5)
search.fit(X, y)

# 輸出最佳得分和最佳參數
print("最佳得分：", search.best_score_)
print("最佳參數：", search.best_params_)

四、數據預處理

數據預處理是一個機器學習流程中必不可少的一環，它對於數據的質量和可用性起着至關重要的作用。雖然 Scikit-learn 能夠處理缺失值和數值數據，但對於非數值數據，如類別數據，通常需要進行編碼，否則模型無法處理。處理數據缺失問題的方法有很多，包括刪除、替換和插補等。

為了更好地處理數據，Scikit-learn 提供了一些預處理工具，例如：標準化、正則化、二值化和獨熱編碼。這些方法可以幫助我們更好地預處理數據。


from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# 構建輸入數據，包含缺失數據和類別數據
X = np.array([[1, 2], [np.nan, 3], [7, 6], [4, np.nan], [5, 5]])
y = np.array(['a', 'b', 'a', 'b', 'c'])

# 缺失數據處理
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# 類別數據處理
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
y_encoded = labelencoder.fit_transform(y)

# 正態分布標準化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# 獨熱編碼
ohe = OneHotEncoder()
y_ohe = ohe.fit_transform(y_encoded.reshape(-1, 1)).toarray()

五、模型評估

評估模型是機器學習實踐中非常重要的環節，它能夠幫助我們了解模型的性能和預測效果。Scikit-learn 提供了多種評估模型的方法，例如：精度、召回率、F1 得分和ROC曲線等。


from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# 讀取數據
iris = load_iris()
X = iris.data
y = iris.target

# 將數據分成訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# 創建邏輯回歸模型對象
clf = LogisticRegression()

# 訓練邏輯回歸模型
clf.fit(X_train, y_train)

# 進行預測
y_pred = clf.predict(X_test)

# 精度、召回率和F1得分
print("精度：{}".format(accuracy_score(y_test, y_pred)))
print("召回率：{}".format(recall_score(y_test, y_pred, average='macro')))
print("F1得分：{}".format(f1_score(y_test, y_pred, average='macro')))

原創文章，作者：VLXO，如若轉載，請註明出處：https://www.506064.com/zh-hant/n/142028.html