金融商业算法建模:基于Python和SAS
上QQ阅读APP看书,第一时间看更新

2.1.4 Python案例:多元线性回归变量筛选

本节就向前回归法的变量筛选进行演示,首先定义一个向前选择的函数:


def forward_select(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = float('inf'), float('inf')
    while remaining:
        aic_with_candidates=[]
        for candidate in remaining:
            formula = "{} ~ {}".format(
                response,' + '.join(selected + [candidate]))
            aic = ols(formula=formula, data=data).fit().aic
            aic_with_candidates.append((aic, candidate))
        aic_with_candidates.sort(reverse=True)
        best_new_score, best_candidate=aic_with_candidates.pop()
        if current_score > best_new_score: 
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
            print ('aic is {},continuing!'.format(current_score))
        else:        
            print ('forward selection over!')
            break
    formula = "{} ~ {} ".format(response,' + '.join(selected))
    print('final formula is {}'.format(formula))
    model = ols(formula=formula, data=data).fit()
    return(model)

我们在代码中将赤池信息量(aic)作为变量选择标准,该值越小越好。利用这个函数,我们对收入、年龄、地区平均房价、地区平均收入这几个自变量进行筛选:


data_for_select = train[['avg_exp', 'Income', 'Age', 'dist_home_val', 
                         'dist_avg_income']]
forward_select_model = forward_select(data=data_for_select, response='avg_exp')
print(forward_select_model.rsquared)

输出结果如下:


aic is 1007.6801413968115, continuing !
aic is 1005.4969816306302,continuing!
aic is 1005.2487355956046, continuing !
forward selection over !
final formula is avg_exp ~ dist_avg_income + Income + dist_home_val
0.5411512928411949

可以看到,aic降到了1005.25,算法最终删除了地区平均收入,此时的拟合优度R2为0.541。