[Day 28]粗糙集特徵选择简介-6

这里我用 pandas.DataFrame 里的 groupby 帮我做分类
然後用 apply(list) 把所有列的结果输出
就可以得到论文中一开始所说的「等价类」

# equivalence relation
def eq_relation(f_list, data, item_name, subset = None):
    '''
    f_list : 特徵子集
    data : 观察集
    item_name : 观察集中代表观察样本编号的栏位名称
    subset : 论文之後会用到，当只想看样本子集时可用
    '''
    
    if subset is None:
        subset = data[item_name]
        
    cut = (data[item_name].isin(subset))
    temp = data[cut]
    
    res = temp.groupby(f_list)
    
    return list(res[item_name].apply(list))

由於目前所讲到的粗糙集特徵选取只使用到 POS(内部)
所以这里只写了 POS 的部分
就是一个一个看有哪些 P 的等价类被包含在 Q 的等价类中

def pos_dep(f_list, q_list, data, item_name, subset = None):
    if subset is None:
        subset1 = data[item_name]
    
    if len(f_list)*len(q_list)==0:
        return 0
    
    modP = eq_relation(f_list, data, item_name, subset = subset1)
    modQ = eq_relation(q_list, data, item_name, subset = subset1)
    
    pos_list = [[p for p in modP if len([p1 for p1 in p if p1 not in q])==0] for q in modQ]
    union_pos = list(set().union(*[list(set().union(*p)) for p in pos_list]))
    
    return len(union_pos)/len(data[item_name])

最後就是模仿向前特徵选取
把 pos_dep 当作模型表现力
每次只新增可以让模型表现最好的

def rough_feature_selection(q_list, data, item_name, feature_list, subset = None):
    fs_list = []
    temp_fs = []
    best_performance = 0
    temp_performance = -1
    
    while temp_performance != best_performance:

        temp_performance = best_performance
        
        for f in [feat for feat in feature_list if feat not in fs_list]:
            now_per = pos_dep(f_list = fs_list + [f],
                              q_list = q_list,
                              data = data,
                              item_name = item_name)
            past_per = pos_dep(f_list = fs_list,
                               q_list = q_list,
                               data = data,
                               item_name = item_name)
            
            if now_per > past_per and now_per > best_performance:
                temp_fs = fs_list + [f]
                best_performance = now_per
                
        fs_list = temp_fs
                
    return temp_fs, best_performance

写的还是很冗长请见谅
我还会再多多练习

<<: [Day27] GO Bot主动传送讯息

>>: Day27:Azure小白如何使用Azure Kubernetes Service部署Container应用程序