CatBoost介紹及其與其它算法的比較

一、CatBoost簡介

CatBoost是一種基於梯度提升樹（グラディエントブースティング）算法的機器學習框架。CatBoost最初由俄羅斯搜索引擎Yandex的工程師開發，支持分類和回歸任務，並支持特徵類別（cateogorical features）。

與XGBoost和LightGBM類似，CatBoost使用梯度提升樹算法，其主要特點是能夠自適應學習率（adaptive learning rate）和統計學習。

二、CatBoost與XGBoost、LightGBM比較

1. 訓練速度

在CatBoost發布之前，XGBoost和LightGBM是最常用的梯度提升樹框架。但是，CatBoost在訓練速度方面表現出色，特別是在特徵是非數字類型時，CatBoost的表現更優。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb
import time

X, y = make_classification(n_samples=100000, n_features=200, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

start_time = time.time()
model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6)
model.fit(X_train, y_train)
print(f"Time taken by CatBoost : {time.time()-start_time:.2f} seconds")

start_time = time.time()
model = LGBMClassifier(num_iterations=1000, learning_rate=0.1, max_depth=6, num_leaves=31)
model.fit(X_train, y_train)
print(f"Time taken by LightGBM : {time.time()-start_time:.2f} seconds")

start_time = time.time()
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
param = {'max_depth': 6, 'eta': 0.1, 'objective': 'binary:logistic'}
num_round = 1000
model = xgb.train(param, dtrain, num_round)
print(f"Time taken by XGBoost : {time.time()-start_time:.2f} seconds")

2. 過擬合的處理

過擬合是機器學習領域的一個常見問題，對於訓練數據過度擬合會使模型對於新的數據的預測效果變差，而XGBoost和LightGBM在解決過擬合問題上都需要額外的手動調整（early stopping和正則化），而CatBoost擁有自己獨特的解決方式，稱為”random”。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb

X, y = make_classification(n_samples=100000, n_features=200, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# XGBoost的過擬合處理
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
param = {'max_depth': 6, 'eta': 0.1, 'objective': 'binary:logistic'}
num_round = 1000
evallist = [(dtest, 'eval'), (dtrain, 'train')]
model = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=10)

# LightGBM的過擬合處理
model = LGBMClassifier(num_iterations=1000, learning_rate=0.1, max_depth=6, num_leaves=31, objective='binary', reg_alpha=1, reg_lambda=1)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=10)

# CatBoost的自動隨機過擬合處理
model = CatBoostClassifier(iterations=1000, loss_function='MultiClass', eval_metric='MultiClass', random_strength=0.1, l2_leaf_reg=4)
model.fit(X_train, y_train, eval_set=(X_test, y_test), use_best_model=True, plot=True)

3. 處理分類特徵

CatBoost可以方便地處理分類特徵。基礎算法無法像LightGBM和XGBoost一樣處理分類特徵，導致在特徵是分類特徵時表現不佳。為此，CatBoost使用了一個分類特徵編碼器（CatBoostEncoder），用基礎算法替換類別特徵。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, CatBoostEncoder

X, y = make_classification(n_samples=100000, n_features=200, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, cat_features=list(range(0, 20)))

encoder = CatBoostEncoder()
encoder.fit(X_train, y_train)

X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

model.fit(X_train_enc, y_train, eval_set=(X_test_enc, y_test), verbose=False, plot=True)

4. Adaboost的改進

Adaboost是一種流行的分類算法，但它只能使用單個基本學習器。因此，CatBoost使用多棵樹構建Adaboost模型，提高了模型的準確性。與傳統Adaboost不同的是，CatBoost使用不同的學習率，以平衡整個模型。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

X, y = make_classification(n_samples=100000, n_features=20, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

adaboost_clf = AdaBoostClassifier(random_state=42)
adaboost_clf.fit(X_train, y_train)

model = CatBoostClassifier(loss_function='Logloss', iterations=100, random_strength=0.1, max_depth=2, learning_rate=0.1)
model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=False, plot=True)

原創文章，作者：小藍，如若轉載，請註明出處：https://www.506064.com/zh-hant/n/186066.html