gplearn全方位解析

一、gplearn因子

gplearn是一個基於Python開發的遺傳編程算法庫，可以通過遺傳編程的方式，自動尋找最符合數據集特徵的函數。其中，gplearn中的因子概念是非常重要的，因為所有函數都是由因子組合而成的。gplearn中的因子包括常數、變量、自定義函數和操作符。

先看一個簡單的例子，通過gplearn來擬合一條直線方程y = 2x + 1。那麼，我們需要定義的因子就是x變量和常數1和2，其中操作符『+』和『×』都是gplearn庫提供的。

from gplearn.genetic import SymbolicRegressor
import numpy as np

x = np.arange(-1,1,0.1)
y = 2*x + 1

est_gp = SymbolicRegressor(population_size=5000, generations=50, stopping_criteria=0.01,
                           const_range=(-1,1), 
                           p_crossover=0.7, p_subtree_mutation=0.1, p_hoist_mutation=0.05, 
                           p_point_mutation=0.1, max_samples=0.9, verbose=1,
                           parsimony_coefficient=0.01, random_state=0)

est_gp.fit(x.reshape(-1,1), y)

print(est_gp._program)

代碼輸出為：add(mul(2, X0), 0.993),其中X0表示x變量，0.993表示常數，add代表操作符「+」，mul代表操作符「×」。

二、gplearn和deap因子

gplearn除了包含其自身的因子之外，也可以與其他遺傳編程算法庫進行結合，以擴充其自身的因子庫。其中，deap是Python中另一個常用的遺傳編程算法庫，兩者結合可以構造更為豐富的因子庫。

以deap中的因子為例，其中的因子可以分為三類：終止符、中間符和算子。終止符是描述數據輸入的因子，而中間符用於描述計算過程中的結果，算子則是種群進化過程中用於結合因子的組合運算符。

gplearn中可以通過定義deap中的因子來擴充其自身的因子庫。以deap中的基本因子庫介紹：

from deap import gp
from gplearn.genetic import SymbolicRegressor
import numpy as np

def protectedDiv(x1, x2):
    if x2 == 0:
        return 1
    else:
        return x1/x2

pset = gp.PrimitiveSet("MAIN", 2)
pset.addPrimitive(np.add, 2)
pset.addPrimitive(np.subtract, 2)
pset.addPrimitive(np.multiply, 2)
pset.addPrimitive(protectedDiv, 2)
pset.addPrimitive(np.sin, 1)
pset.addPrimitive(np.cos, 1)
pset.addPrimitive(np.tan, 1)
pset.addPrimitive(np.square, 1)
pset.addPrimitive(np.sqrt, 1)

creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", gp.PrimitiveTree, fitness=creator.FitnessMin)

toolbox = base.Toolbox()
toolbox.register("expr", gp.genHalfAndHalf, pset=pset, min_=1, max_=2)
toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.expr)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)

est_gp = SymbolicRegressor(population_size=5000, generations=50, stopping_criteria=0.01,
                           const_range=(-1, 1), 
                           p_crossover=0.7, p_subtree_mutation=0.1, p_hoist_mutation=0.05, 
                           p_point_mutation=0.1, max_samples=0.9, verbose=1,
                           parsimony_coefficient=0.01, random_state=0, 
                           function_set=toolbox)

est_gp.fit(x.reshape(-1,1), y)

print(est_gp._program)

上述代碼中的內容與之前的例子類似，只是在gplearn中的SymbolicRegressor中，引入了deap庫中定義的基本因子庫，其中包含了加、減、乘、除、正弦、餘弦、正切、平方和平方根等基本的數學函數運算。

三、gplearn聚寬

gplearn可以與聚寬結合，進行量化選股策略的研發。聚寬是一個國內領先的量化平台，提供了較為全面的量化交易解決方案，在量化選股、策略回測、實盤交易、風險管理等方面具備強大的技術支撐。

下面的例子演示了如何將gplearn與聚寬平台結合，進行快速策略開發。以判斷空頭市場的趨勢為例，通過gplearn的SymbolicClassifier算法，進行模型的訓練和預測。

以月線數據為例，定義了若干個市場趨勢相關的特徵因子，由gplearn自動優化其組合方式。

##########定義函數因子##########
pset = gp.PrimitiveSetTyped("MAIN", [float,float,float], bool)
pset.renameArguments(ARG0='open',ARG1='high',ARG2='low')
def if_then_else(input, output1, output2):
    return output1 if input else output2
pset.addPrimitive(np.greater, [float,float], bool)
pset.addPrimitive(np.subtract, [float,float], float)
pset.addPrimitive(np.add, [float,float], float)
pset.addPrimitive(np.multiply, [float,float], float)
pset.addPrimitive(if_then_else, [bool,float,float], float)
pset.addTerminal(0.6)
pset.addTerminal(0.4)
pset.addPrimitive(operator.and_,[bool,bool],bool)
pset.addPrimitive(operator.or_,[bool,bool],bool)
pset.addPrimitive(operator.not_,[bool],bool)
def sqrt(x1):
    return np.sqrt(abs(x1))
def square(x1):
    return x1*x1
pset.addPrimitive(sqrt,[float,float],float)
pset.addPrimitive(square,[float,float],float)

##########定義SymbolicClassifier##########
from gplearn.genetic import SymbolicClassifier
clf = SymbolicClassifier(population_size=1000,
                          generations=20,
                          tournament_size=20,
                          stopping_criteria=0.01,
                          function_set=pset,
                          random_state=0,
                          verbose=1)

 ##########選擇股票池##########
def initialize(context):
    set_benchmark('000300.XSHG')
    g.stocks = get_index_stocks('000300.XSHG')

##########定義交易邏輯##########
def handle_bar(context, bar_dict):
    if context.stock_buyed:
        orders = context.portfolio.positions.keys()
        if len(orders)>0:
            for stock in orders:
                order_target_value(stock, 0)

    X = get_bars(context.stocks,
                 count = 300,
                 unit = '1d',
                 fields=[ 'open','close', 'high', 'low', 'volume','money'],
                 include_now=True)
    X['diff']=(X['close']-X['open'])/X['open']
    X['avg_price'] = X['money']/X['volume']
    X['vol_ave_3'] = X['volume'].rolling(window=3).mean()
    X['vol_ave_5'] = X['volume'].rolling(window=5).mean()
    X['vol_ave_10'] = X['volume'].rolling(window=10).mean()
    X = X.dropna()
    y = np.where(X['close']-X['open']>0, True, False)

    clf.fit(X[['open','high','low']].tail(len(y)), y)
    predict = clf.predict(X[['open','high','low']].iloc[-1].reshape(1, -1))[0]

    if predict and context.stock_buyed is False:
        market_cap_list = []
        adv20_list = []
        for stock in context.stocks:
            q = query(valuation.market_cap, valuation.circulating_market_cap).filter(valuation.code == stock)
            df = get_fundamentals(q)
            market_cap = df['market_cap'][0]
            adv20_value = average_volume(stock,20)
            market_cap_list.append(market_cap)
            adv20_list.append(adv20_value)
        context.stocks = [x for _,x in sorted(zip(market_cap_list,context.stocks))]
        context.stocks = context.stocks[:int(len(context.stocks)/2)]
        for stock in context.stocks:
            if not is_st_stock(stock):
                if not is_paused(stock):
                    if not is_high_limit(stock):
                        order_target_percent(stock, 1.0/len(context.stocks))

實際運行效果取決於所選取的連接功能的股票池數量和所選擇的因子，可以在聚寬平台上自由調整。

四、gplearn因子挖掘

gplearn還可以基於遺傳編程的思想，對數據集中的因子進行挖掘。以Iris為例，進行特徵提取。

##########引入數據##########
from sklearn.datasets import load_iris
from gplearn.genetic import SymbolicTransformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data = load_iris()

##########定義gplearn對象##########
est_gp = SymbolicTransformer(population_size=5000, 
                               hall_of_fame=100,
                               n_components=10,
                               generations=20,
                               tournament_size=20,
                               stopping_criteria=0.01,
                               p_crossover=0.7,
                               p_subtree_mutation=0.1,
                               p_hoist_mutation=0.05,
                               p_point_mutation=0.1,
                               max_samples=0.9,
                               verbose=1,
                               random_state=0,
                               function_set=('add', 'sub', 'mul', 'div',
                                             'sin', 'cos', 'tan', 'sqrt', 'log', 'abs',
                                             max, min))

X_train, X_test, y_train, y_test = train_test_split(
        data.data, data.target, test_size=0.2, random_state=0)

pipeline = make_pipeline(est_gp, LogisticRegression())

pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))

上述代碼中，通過load_iris函數引入了iris數據集，其中包含樣本為150個，每個樣本有四個特徵。

定義了gplearn.SymbolicTransformer，使用遺傳編程的思想構建出最表達能力最強的因子組合。10的n_components參數限制了運行的最長時間。這裡10個因子表達式產生在20個世代中，每個世代由5000個個體組成。

五、gplearn自定義函數

gplearn中的函數庫可以通過自定義函數進行擴充，使用gplearn基於遺傳編程的方式，尋找最優的函數擬合模型。自定義函數與已有函數一樣具有可重複性以及可擴展性，同樣可以快速構建符合實際需求的數學模型。

以x的三次方作為自定義函數，進行樣本擬合為例：

from gplearn.genetic import SymbolicRegressor
import numpy as np

def my_func(x):
    return x**3

est_gp = SymbolicRegressor(population_size=5000, generations=75, stopping_criteria=0.01,
                           elitism=3, tournament_size=20, const_range=(-1, 1), 
                           p_crossover=0.7, p_subtree_mutation=0.1, p_hoist_mutation=0.05,
                           p_point_mutation=0.1, max_samples=0.9, verbose=1,
                           function_set=['add', 'sub', 'mul', 'div', 'sqrt', 'log', my_func],
                           parsimony_coefficient=0.01, random_state=0)

x = np.arange(-1,1,0.1)
y = x**3 - 0.5

est_gp.fit(x.reshape(-1,1), y)

print(est_gp._program)

定義了一個三次方的自定義函數my_func，然後將其加入到function_set中。在算法執行過程中

原創文章，作者：KCNJL，如若轉載，請註明出處：https://www.506064.com/zh-hk/n/333118.html