![金融商业算法建模:基于Python和SAS](https://wfqqreader-1252317822.image.myqcloud.com/cover/743/41426743/b_41426743.jpg)
上QQ阅读APP看书,第一时间看更新
2.1.4 Python案例:多元线性回归变量筛选
本节就向前回归法的变量筛选进行演示,首先定义一个向前选择的函数:
def forward_select(data, response): remaining = set(data.columns) remaining.remove(response) selected = [] current_score, best_new_score = float('inf'), float('inf') while remaining: aic_with_candidates=[] for candidate in remaining: formula = "{} ~ {}".format( response,' + '.join(selected + [candidate])) aic = ols(formula=formula, data=data).fit().aic aic_with_candidates.append((aic, candidate)) aic_with_candidates.sort(reverse=True) best_new_score, best_candidate=aic_with_candidates.pop() if current_score > best_new_score: remaining.remove(best_candidate) selected.append(best_candidate) current_score = best_new_score print ('aic is {},continuing!'.format(current_score)) else: print ('forward selection over!') break formula = "{} ~ {} ".format(response,' + '.join(selected)) print('final formula is {}'.format(formula)) model = ols(formula=formula, data=data).fit() return(model)
我们在代码中将赤池信息量(aic)作为变量选择标准,该值越小越好。利用这个函数,我们对收入、年龄、地区平均房价、地区平均收入这几个自变量进行筛选:
data_for_select = train[['avg_exp', 'Income', 'Age', 'dist_home_val', 'dist_avg_income']] forward_select_model = forward_select(data=data_for_select, response='avg_exp') print(forward_select_model.rsquared)
输出结果如下:
aic is 1007.6801413968115, continuing ! aic is 1005.4969816306302,continuing! aic is 1005.2487355956046, continuing ! forward selection over ! final formula is avg_exp ~ dist_avg_income + Income + dist_home_val 0.5411512928411949
可以看到,aic降到了1005.25,算法最终删除了地区平均收入,此时的拟合优度R2为0.541。