一、CatBoost簡介
CatBoost是一種基於梯度提升樹(グラディエントブースティング)演算法的機器學習框架。CatBoost最初由俄羅斯搜索引擎Yandex的工程師開發,支持分類和回歸任務,並支持特徵類別(cateogorical features)。
與XGBoost和LightGBM類似,CatBoost使用梯度提升樹演算法,其主要特點是能夠自適應學習率(adaptive learning rate)和統計學習。
二、CatBoost與XGBoost、LightGBM比較
1. 訓練速度
在CatBoost發布之前,XGBoost和LightGBM是最常用的梯度提升樹框架。但是,CatBoost在訓練速度方面表現出色,特別是在特徵是非數字類型時,CatBoost的表現更優。
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb
import time
X, y = make_classification(n_samples=100000, n_features=200, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
start_time = time.time()
model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6)
model.fit(X_train, y_train)
print(f"Time taken by CatBoost : {time.time()-start_time:.2f} seconds")
start_time = time.time()
model = LGBMClassifier(num_iterations=1000, learning_rate=0.1, max_depth=6, num_leaves=31)
model.fit(X_train, y_train)
print(f"Time taken by LightGBM : {time.time()-start_time:.2f} seconds")
start_time = time.time()
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
param = {'max_depth': 6, 'eta': 0.1, 'objective': 'binary:logistic'}
num_round = 1000
model = xgb.train(param, dtrain, num_round)
print(f"Time taken by XGBoost : {time.time()-start_time:.2f} seconds")
2. 過擬合的處理
過擬合是機器學習領域的一個常見問題,對於訓練數據過度擬合會使模型對於新的數據的預測效果變差,而XGBoost和LightGBM在解決過擬合問題上都需要額外的手動調整(early stopping和正則化),而CatBoost擁有自己獨特的解決方式,稱為”random」。
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb
X, y = make_classification(n_samples=100000, n_features=200, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# XGBoost的過擬合處理
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
param = {'max_depth': 6, 'eta': 0.1, 'objective': 'binary:logistic'}
num_round = 1000
evallist = [(dtest, 'eval'), (dtrain, 'train')]
model = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=10)
# LightGBM的過擬合處理
model = LGBMClassifier(num_iterations=1000, learning_rate=0.1, max_depth=6, num_leaves=31, objective='binary', reg_alpha=1, reg_lambda=1)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=10)
# CatBoost的自動隨機過擬合處理
model = CatBoostClassifier(iterations=1000, loss_function='MultiClass', eval_metric='MultiClass', random_strength=0.1, l2_leaf_reg=4)
model.fit(X_train, y_train, eval_set=(X_test, y_test), use_best_model=True, plot=True)
3. 處理分類特徵
CatBoost可以方便地處理分類特徵。基礎演算法無法像LightGBM和XGBoost一樣處理分類特徵,導致在特徵是分類特徵時表現不佳。為此,CatBoost使用了一個分類特徵編碼器(CatBoostEncoder),用基礎演算法替換類別特徵。
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from catboost import CatBoostClassifier, CatBoostEncoder X, y = make_classification(n_samples=100000, n_features=200, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, cat_features=list(range(0, 20))) encoder = CatBoostEncoder() encoder.fit(X_train, y_train) X_train_enc = encoder.transform(X_train) X_test_enc = encoder.transform(X_test) model.fit(X_train_enc, y_train, eval_set=(X_test_enc, y_test), verbose=False, plot=True)
4. Adaboost的改進
Adaboost是一種流行的分類演算法,但它只能使用單個基本學習器。因此,CatBoost使用多棵樹構建Adaboost模型,提高了模型的準確性。與傳統Adaboost不同的是,CatBoost使用不同的學習率,以平衡整個模型。
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from catboost import CatBoostClassifier from sklearn.ensemble import AdaBoostClassifier X, y = make_classification(n_samples=100000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) adaboost_clf = AdaBoostClassifier(random_state=42) adaboost_clf.fit(X_train, y_train) model = CatBoostClassifier(loss_function='Logloss', iterations=100, random_strength=0.1, max_depth=2, learning_rate=0.1) model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=False, plot=True)
原創文章,作者:小藍,如若轉載,請註明出處:https://www.506064.com/zh-tw/n/186066.html
微信掃一掃
支付寶掃一掃