使用scikit-learn消除随机森林上的递归特征

这是我的想法。这是一个非常简单的解决方案，并且由于我要对高度不平衡的数据集进行分类，因此它依赖于自定义精度指标（称为weightedAccuracy）。但是，如果需要，应该很容易将其扩展。

from sklearn import datasets
import pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix


def get_enhanced_confusion_matrix(actuals, predictions, labels):
    """"enhances confusion_matrix by adding sensivity and specificity metrics"""
    cm = confusion_matrix(actuals, predictions, labels = labels)
    sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
    specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
    weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
    return cm, sensitivity, specificity, weightedAccuracy

iris = datasets.load_iris()
x=pandas.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pandas.Series(iris.target, name='target')

response, _  = pandas.factorize(y)

xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
print "building the first forest"
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
                                }).sort(['imp'], ascending = False).reset_index(drop = True)

cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
numFeatures = len(x.columns)

rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures], 
                              'weightedAccuracy':[weightedAccuracy], 
                              'sensitivity':[sensitivity], 
                              'specificity':[specificity]})

print "running RFE on  %d features"%numFeatures

for i in range(1,numFeatures,1):
    varsUsed = importances['name'][0:i]
    print "Now using %d of %s features"%(len(varsUsed), numFeatures)
    xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
    rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
                                n_jobs = -1, verbose = 1)
    rf.fit(xTrain, yTrain)
    cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
    print("\n"+str(cm))
    print('the sensitivity is %d percent'%(sensitivity * 100))
    print('the specificity is %d percent'%(specificity * 100))
    print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
    rfeMatrix = rfeMatrix.append(
                                pandas.DataFrame({'numFeatures':[len(varsUsed)], 
                                'weightedAccuracy':[weightedAccuracy], 
                                'sensitivity':[sensitivity], 
                                'specificity':[specificity]}), ignore_index = True)    
print("\n"+str(rfeMatrix))    
maxAccuracy = rfeMatrix.weightedAccuracy.max()
maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()

print "the final features used are %s"%featuresUsed

其他 2022/1/1 18:41:42 有388人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

使用scikit-learn消除随机森林上的递归特征

撰写回答

推荐问题

Greasemonkey 1.0中的jQuery与使用jQuery的网站冲突

如何使用JSON-LD标记面包屑列表中的最后一个非链接项目

如何在Spring MVC中使用AJAX渲染视图

使用动态where子句休眠

如何使用jQuery访问父窗口对象？

使用Curl和PHP使会话保持活动状态

如何建立一个动态查询，该查询增加了迄今为止的天数，并使用标准API比较该日期与另一个日期？

使用LESS构建选择器列表

如何使用CSS将跨度更改为类似pre？

在mysql sproc中使用变量作为表名

如何使用C＃获取两个DateTime对象之间的时差？

我可以在php中的SESSION数组上使用array_push吗？

Django-如何使用South重命名模型字段？

使用Spring Functional Web Framework的REST端点的背压

使用GhostDriver时如何设置屏幕/窗口大小

如何使用最新版本的jQuery并在RichFaces中为jQuery取回“ $”？

我可以使用BeautifulSoup删除脚本标签吗？

多态对象的JSON使用者

我如何重新连接使用selenium的webdriver打开的浏览器？

如何使用Servlet和Ajax？

分类汇总

您的鼓励是对我最大的支持