Day 11 [Python ML、特徵工程] 特徵选择

Introduction

再经过特徵编码(feature encodings)和特徵产生(feature generation)後，我们会发现特徵太多了，可能会造成过拟和(overfitting)或是需要训练的时间很久，因此我们需要一些方法来筛选特徵

%matplotlib inline

import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

ks = pd.read_csv('./ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])

# Drop live projects
ks = ks.query('state != "live"')

# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

# Timestamp features
ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

data_cols = ['goal', 'hour', 'day', 'month', 'year', 'outcome']
baseline_data = ks[data_cols].join(encoded)

cat_features = ['category', 'currency', 'country']
interactions = pd.DataFrame(index=ks.index)
for col1, col2 in itertools.combinations(cat_features, 2):
    new_col_name = '_'.join([col1, col2])
    # Convert to strings and combine
    new_values = ks[col1].map(str) + "_" + ks[col2].map(str)
    label_enc = LabelEncoder()
    interactions[new_col_name] = label_enc.fit_transform(new_values)
baseline_data = baseline_data.join(interactions)

launched = pd.Series(ks.index, index=ks.launched, name="count_7_days").sort_index()
count_7_days = launched.rolling('7d').count() - 1
count_7_days.index = launched.values
count_7_days = count_7_days.reindex(ks.index)

baseline_data = baseline_data.join(count_7_days)

def time_since_last_project(series):
    # Return the time in hours
    return series.diff().dt.total_seconds() / 3600.

df = ks[['category', 'launched']].sort_values('launched')
timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas = timedeltas.fillna(timedeltas.max())

baseline_data = baseline_data.join(timedeltas.rename({'launched': 'time_since_last_project'}, axis=1))

def get_data_splits(dataframe, valid_fraction=0.1):
    valid_fraction = 0.1
    valid_size = int(len(dataframe) * valid_fraction)

    train = dataframe[:-valid_size * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_size * 2:-valid_size]
    test = dataframe[-valid_size:]
    
    return train, valid, test

def train_model(train, valid):
    feature_cols = train.columns.drop('outcome')

    dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    print("Training model!")
    bst = lgb.train(param, dtrain, num_boost_round=1000, valid_sets=[dvalid], 
                    early_stopping_rounds=10, verbose_eval=False)

    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['outcome'], valid_pred)
    print(f"Validation AUC score: {valid_score:.4f}")
    return bst

Univariate Feature Selection

我们可以利用统计学上的几个方法来分析

F-values测量feature变数和target之间的线性相关(linear dependency)
这代表说若资料不是线性的(nonlinear)，测量的分数可能会低估他们(Feature变数和target)之间的关系
这个时候因为mutural information score是nonparametric(非参数)的，因此能够测量非线性资料之间的关系

使用feature_slelction.SelectKBest，可以定义我们想要保留多少参数，使用.fit_transform(features, target)，我们可以取得选择之後的features

baseline_data.columns.size

beseline_data中目前有14个feature

from sklearn.feature_selection import SelectKBest, f_classif

feature_cols = baseline_data.columns.drop('outcome')

# Keep 5 features
selector = SelectKBest(f_classif, k=5)

X_new = selector.fit_transform(baseline_data[feature_cols], baseline_data['outcome'])
X_new

array([[2015.,    5.,    9.,   18., 1409.],
       [2017.,   13.,   22.,   31.,  957.],
       [2013.,   13.,   22.,   31.,  739.],
       ...,
       [2010.,   13.,   22.,   31.,  238.],
       [2016.,   13.,   22.,   31., 1100.],
       [2011.,   13.,   22.,   31.,  542.]])

将资料保留5个特徵

但是以上的方法有个问题，并没有将资料分为训练(train)、测试(test)、验证(valid)，会导致将target计算到test和valid资料中，会导致训练出来的模型不好，因此在使用此方法之前要先将资料做切割

feature_cols = baseline_data.columns.drop('outcome')
train, valid, _ = get_data_splits(baseline_data)

# Keep 5 features
selector = SelectKBest(f_classif, k=5)

X_new = selector.fit_transform(train[feature_cols], train['outcome'])
X_new

array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],
       [2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],
       [2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],
       ...,
       [2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],
       [2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],
       [2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])

这时会发现选择的feature的columns跟原本的不一样，因此需要将资料转回原本的型态之後，在将0的部份去掉

这个时候可以使用.inverse_transform去取得转换前的资料

# Get back the features we've kept, zero out all other features
selected_features = pd.DataFrame(selector.inverse_transform(X_new), 
                                 index=train.index, 
                                 columns=feature_cols)
selected_features.head()

然後将0的值去掉

# Dropped columns have values of all 0s, so var is 0, drop them
selected_columns = selected_features.columns[selected_features.var() != 0]

# Get the valid dataset with the selected features.
valid[selected_columns].join(valid['outcome']).head()

特徵选择後的AUC分数为0.6010

train_model(train[selected_columns].join(train['outcome']), valid[selected_columns].join(valid['outcome']))

Training model!
[LightGBM] [Info] Number of positive: 107340, number of negative: 193350
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007036 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 335
[LightGBM] [Info] Number of data points in the train set: 300690, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.356979 -> initscore=-0.588501
[LightGBM] [Info] Start training from score -0.588501
Validation AUC score: 0.6010





<lightgbm.basic.Booster at 0x7fbb8b7685c0>

原始资料的AUC分数为0.7446

train_model(train, valid)

Training model!
[LightGBM] [Info] Number of positive: 107340, number of negative: 193350
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007786 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1553
[LightGBM] [Info] Number of data points in the train set: 300690, number of used features: 13
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.356979 -> initscore=-0.588501
[LightGBM] [Info] Start training from score -0.588501
Validation AUC score: 0.7446





<lightgbm.basic.Booster at 0x7fbb8729c5f8>

L1 regularization(李弘毅老师 1 Regression有提到，让线变平滑的方式)

上面的方法是使用单变量对资料做处理，每一个feature对target的影响

L1 regularization是利用全部的资料对target的影响去做判断

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

train, valid, _ = get_data_splits(baseline_data)

X, y = train[train.columns.drop("outcome")], train['outcome']

# Set the regularization parameter C=1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, y)
model = SelectFromModel(logistic, prefit=True)

X_new = model.transform(X)
X_new

array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,
        1.409e+03],
       [3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,
        9.570e+02],
       [4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,
        7.390e+02],
       ...,
       [2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,
        5.150e+02],
       [2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,
        1.306e+03],
       [2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,
        1.084e+03]])

跟前面单变量的资料一样，会回传选择的columns

将columns为0的值去掉後，就会得到选择的columns

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new), 
                                 index=X.index,
                                 columns=X.columns)

# Dropped columns have values of all 0s, keep other columns 
selected_columns = selected_features.columns[selected_features.var() != 0]

train_model(train[selected_columns].join(train['outcome']), valid[selected_columns].join(valid['outcome']))

Training model!
[LightGBM] [Info] Number of positive: 107340, number of negative: 193350
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007739 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1298
[LightGBM] [Info] Number of data points in the train set: 300690, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.356979 -> initscore=-0.588501
[LightGBM] [Info] Start training from score -0.588501
Validation AUC score: 0.7462





<lightgbm.basic.Booster at 0x7fbb8729b128>

经过L1 regularization的AUC分数为0.7462

在这个case中，我们将time_since_last_project这个column去掉

在实际的情形中，L1 regularization是比Univariate测试的方法好，但若是feature很多，这个方法会跑得很慢

Univariate test在资料量大的时候跑得会比较快，但这个方法的效果并没有那麽好

<<: [Day 9]从零开始学习 JS 的连续-30 Days---物件

>>: Day17-"与字串相关的函式-3"

Day 11 [Python ML、特徵工程] 特徵选择

Introduction

Univariate Feature Selection

L1 regularization(李弘毅老师 1 Regression有提到，让线变平滑的方式)

Day 30 | ContentProvider

成员 4 人：杰克爬上藤蔓，不想救公主

js notes(1)

【领域展开 22 式】初次认识 Jetpack 与启用

[Kata] Clojure - Day 30

Day 08 CSS样式改动及资料绑定详述

[DAY 29] Edge Computing v.s PC Computing

html icon

韩乡韩国料理 #韩式小菜吃到饱

D28 - 「来互相伤害啊！」：粗乃玩摇杆！