Day 12 [Python ML、特徵工程] 特徵工程整理

Categorical Encoding

Encoding Describe
One hot encoding 根据columns中的状态,例如Sex中有Male及Female,One hot encoding会将资料拆成两个Columns,分别为Sex_male及Sex_female,并且将这两个column转为二分类属性,若资料中有太多不同的状态,会造成资料的维度过大,会让处理时间变得非常的久,建议catrgory数量<4使用
Label Encoding 会根据资料集中的数据,将资料做fit,也就是正规化(Normalization),处理完後再做transform
Count Encoding 会将资料类别(categorical)的标签(label)用资料出现的次数(frequency)替代,意义在於,稀有的资料(rare values)若是类别属性,计算时跟其余资料都是用一样的方式在做计算,而count encoding则可以将资料作权重上面的处理,因此这个方式对类别属性资料来说是有效的
Target Encoding 取出某一个column的标签(label),计算出该状态target的比例占多少,并用这个值取代掉原始资料,建议catrgory数量>4使用
CatBoost Encoding 类似於target encoding

:::warning
在count encoding的时候,因为有使用到target,为了避免泄漏资料给验证资料,因此在fit的时候只能使用train的资料,不能使用valid的资料做fit

  • count encoding
  • catboost encoding
    :::
    :::warning
    若使用target encoding或是catboost encoding的时候,由於会将target拿来做encoding,若将ip这个属性拿来encoding,因为每个ip都会对到一个target,会导致ip_target这个feature预测得太准,若测试资料(test data)中没有找到相同的ip,模型就会不知道改如何预测这个row,因此需要将ip去掉
    try to remove ip encoding
    Target encoding attempts to measure the population mean of the target for each level in a categorical feature. This means when there is less data per level, the estimated mean will be further away from the "true" mean, there will be more variance. There is little data per IP address so it's likely that the estimates are much noisier than for the other features. The model will rely heavily on this feature since it is extremely predictive. This causes it to make fewer splits on other features, and those features are fit on just the errors left over accounting for IP address. So, the model will perform very poorly when seeing new IP addresses that weren't in the training data (which is likely most new data). Going forward, we'll leave out the IP feature when trying different encodings.
    :::

One hot encoding

get_dummies可以将dataframe自动转为one hot encoding的模式,就可以跑这些资料了

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])

原始资料

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

处理後资料

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

LabelEncoder(将类别属性资料转为数值)

目的是为了让一些model能够做训练

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

资料再fit的时後,会将每笔资料都给一个标签,再transform的时候会根据标签将资料做转换
LabelEncoder
axis=0 -> columns
axis=1 -> rows
apply是对每一个资料作处理
fit_transform会先对资料做fit,在将资料transform
fit将资料做正规化

from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)

若要每一个columns独立运作的话,可用以下方法
在fit_transform中丢入资料集

from sklearn.preprocessing import LabelEncoder

cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in clicks using preprocessing.LabelEncoder()
encoder = LabelEncoder()
for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Count Encoding

先import category_encoders
建立encoder ce.CountEncoder()
将要transform的资料丢入 Encoder
再利用add_suffix(加入後缀词)在资料後面加上_count
处理完的资料利用join将其加入原始data资料中

import category_encoders as ce
cat_features = ['category', 'currency', 'country']

# Create the encoder
count_enc = ce.CountEncoder()

# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_enc.fit_transform(ks[cat_features])
data = data.join(count_encoded.add_suffix("_count"))

# Train a model 
train, valid, test = get_data_splits(data)
train_model(train, valid)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Target Encoding

# Create the encoder
target_enc = ce.TargetEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename the columns with _target suffix, and join to dataframe
train_TE = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_TE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

# Train a model
train_model(train_TE, valid_TE)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

CatBoost Encoding

# Create the encoder
target_enc = ce.CatBoostEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename columns with _cb suffix, and join to dataframe
train_CBE = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_CBE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))

# Train a model
train_model(train_CBE, valid_CBE)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Feature Generation

这边提供一些产生特徵的方法

Feature Selection

Introduction

再经过特徵编码(feature encodings)和特徵产生(feature generation)後,我们会发现特徵太多了,可能会造成过拟和(overfitting)或是需要训练的时间很久,因此我们需要一些方法来筛选特徵

Univariate Feature Selection

baseline_data.columns.size
14

原始资料中有14个feature,我们利用这个方法取5个columns出来

:::danger
记得要将资料切割成训练(Train)、测试(Test)、验证(Valid)後再做处理
:::

feature_cols = baseline_data.columns.drop('outcome')
train, valid, _ = get_data_splits(baseline_data)

# Keep 5 features
selector = SelectKBest(f_classif, k=5)

X_new = selector.fit_transform(train[feature_cols], train['outcome'])
X_new
array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],
       [2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],
       [2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],
       ...,
       [2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],
       [2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],
       [2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])
       

这时会发现选择的feature的columns跟原本的不一样,因此需要将资料转回原本的型态之後,在将0的部份去掉

这个时候可以使用.inverse_transform去取得转换前的资料

# Get back the features we've kept, zero out all other features
selected_features = pd.DataFrame(selector.inverse_transform(X_new), 
                                 index=train.index, 
                                 columns=feature_cols)
selected_features.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

然後将0的值去掉

# Dropped columns have values of all 0s, so var is 0, drop them
selected_columns = selected_features.columns[selected_features.var() != 0]

# Get the valid dataset with the selected features.
valid[selected_columns].join(valid['outcome']).head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

L1 regularization(李弘毅老师 1 Regression有提到,让线变平滑的方式)

上面的方法是使用单变量对资料做处理,每一个feature对target的影响

L1 regularization是利用全部的资料对target的影响去做判断

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

train, valid, _ = get_data_splits(baseline_data)

X, y = train[train.columns.drop("outcome")], train['outcome']

# Set the regularization parameter C=1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, y)
model = SelectFromModel(logistic, prefit=True)

X_new = model.transform(X)
X_new
array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,
        1.409e+03],
       [3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,
        9.570e+02],
       [4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,
        7.390e+02],
       ...,
       [2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,
        5.150e+02],
       [2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,
        1.306e+03],
       [2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,
        1.084e+03]])

跟前面单变量的资料一样,会回传选择的columns

将columns为0的值去掉後,就会得到选择的columns

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new), 
                                 index=X.index,
                                 columns=X.columns)

# Dropped columns have values of all 0s, keep other columns 
selected_columns = selected_features.columns[selected_features.var() != 0]

整理

Feature Selection AUC score 适用情形 效果
不处理 0.7446
Univariate Feature Selection 0.6010 资料量大,特徵多 效果差
L1 regularization 0.7462 资料量小,特徵小 效果好

Other Feature Enginner Method

取得资料中的状态

# 取得某一个资料的全部状态
print('Unique values in `state` column:', list(ks.state.unique()))
Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']

drop掉某一个columns

当取得了所有的columns,可以利用专家模式来判断哪一个是不需要的资料,可以将其去掉
ks为dataframe的资料型态
query可以在里面放入boolean或是SQL(详细用法看到再补)

# Drop live projects
ks = ks.query('state != "live"')

将target转为数字(其中的一个状态)

原始资料为

Unique values in `state` column: ['failed', 'canceled', 'successful', 'undefined', 'suspended']

想将successful转为1,其余转为0
assign为 为dataframe分配一个新的列(column)

feature = ['state', 'outcome']
# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))
ks[feature].head(6)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

drop掉缺失值(missing values)

# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)

Convert timestamps(将时间资料转换为columns)

ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

<<:  Day 25 利用transformer自己实作一个翻译程序(七) Scaled dot product attention

>>:  Day10:终於要进去新手村了-Javascript-变数

Arm的选择

上次提到要 Arduino还是Raspberry pi 而看下面可以知道Cortex-A, Cort...

[Day05] python 的第一个模型

写在前面 test for placeholder test for placeholder tes...

Day 24 domain也可以用在 search view上

如果要客制针对使用者输入的字串去进行搜寻, 那麽不妨在search 栏位後面加上domain fil...

出生第48天 铁人完赛日

请不要在意我标题出生日期一直跳,育儿的日子没那麽多废文可以写XD~而且中间很多天在干嘛其实也忘了囧...

环境建置(2)

昨日安装完Anaconda後,今天要继续安装需要的lib, (1)先用管理者权限打开Anaconda...