Encoding | Describe |
---|---|
One hot encoding | 根据columns中的状态,例如Sex中有Male及Female,One hot encoding会将资料拆成两个Columns,分别为Sex_male及Sex_female,并且将这两个column转为二分类属性,若资料中有太多不同的状态,会造成资料的维度过大,会让处理时间变得非常的久,建议catrgory数量<4使用 |
Label Encoding | 会根据资料集中的数据,将资料做fit,也就是正规化(Normalization),处理完後再做transform |
Count Encoding | 会将资料类别(categorical)的标签(label)用资料出现的次数(frequency)替代,意义在於,稀有的资料(rare values)若是类别属性,计算时跟其余资料都是用一样的方式在做计算,而count encoding则可以将资料作权重上面的处理,因此这个方式对类别属性资料来说是有效的 |
Target Encoding | 取出某一个column的标签(label),计算出该状态target的比例占多少,并用这个值取代掉原始资料,建议catrgory数量>4使用 |
CatBoost Encoding | 类似於target encoding |
:::warning
在count encoding的时候,因为有使用到target,为了避免泄漏资料给验证资料,因此在fit的时候只能使用train的资料,不能使用valid的资料做fit
get_dummies可以将dataframe自动转为one hot encoding的模式,就可以跑这些资料了
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
原始资料
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
处理後资料
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
目的是为了让一些model能够做训练
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']
资料再fit的时後,会将每笔资料都给一个标签,再transform的时候会根据标签将资料做转换
LabelEncoder
axis=0
-> columns
axis=1
-> rows
apply
是对每一个资料作处理
fit_transform
会先对资料做fit,在将资料transform
fit
将资料做正规化
from sklearn.preprocessing import LabelEncoder
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)
若要每一个columns独立运作的话,可用以下方法
在fit_transform中丢入资料集
from sklearn.preprocessing import LabelEncoder
cat_features = ['ip', 'app', 'device', 'os', 'channel']
# Create new columns in clicks using preprocessing.LabelEncoder()
encoder = LabelEncoder()
for feature in cat_features:
encoded = encoder.fit_transform(clicks[feature])
clicks[feature + '_labels'] = encoded
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
先import category_encoders
建立encoder ce.CountEncoder()
将要transform的资料丢入 Encoder
再利用add_suffix(加入後缀词)在资料後面加上_count
处理完的资料利用join将其加入原始data资料中
import category_encoders as ce
cat_features = ['category', 'currency', 'country']
# Create the encoder
count_enc = ce.CountEncoder()
# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_enc.fit_transform(ks[cat_features])
data = data.join(count_encoded.add_suffix("_count"))
# Train a model
train, valid, test = get_data_splits(data)
train_model(train, valid)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# Create the encoder
target_enc = ce.TargetEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])
# Transform the features, rename the columns with _target suffix, and join to dataframe
train_TE = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_TE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))
# Train a model
train_model(train_TE, valid_TE)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# Create the encoder
target_enc = ce.CatBoostEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])
# Transform the features, rename columns with _cb suffix, and join to dataframe
train_CBE = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_CBE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))
# Train a model
train_model(train_CBE, valid_CBE)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
这边提供一些产生特徵的方法
再经过特徵编码(feature encodings)和特徵产生(feature generation)後,我们会发现特徵太多了,可能会造成过拟和(overfitting)或是需要训练的时间很久,因此我们需要一些方法来筛选特徵
baseline_data.columns.size
14
原始资料中有14个feature,我们利用这个方法取5个columns出来
:::danger
记得要将资料切割成训练(Train)、测试(Test)、验证(Valid)後再做处理
:::
feature_cols = baseline_data.columns.drop('outcome')
train, valid, _ = get_data_splits(baseline_data)
# Keep 5 features
selector = SelectKBest(f_classif, k=5)
X_new = selector.fit_transform(train[feature_cols], train['outcome'])
X_new
array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],
[2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],
[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],
...,
[2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],
[2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],
[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])
这时会发现选择的feature的columns跟原本的不一样,因此需要将资料转回原本的型态之後,在将0的部份去掉
这个时候可以使用.inverse_transform
去取得转换前的资料
# Get back the features we've kept, zero out all other features
selected_features = pd.DataFrame(selector.inverse_transform(X_new),
index=train.index,
columns=feature_cols)
selected_features.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
然後将0的值去掉
# Dropped columns have values of all 0s, so var is 0, drop them
selected_columns = selected_features.columns[selected_features.var() != 0]
# Get the valid dataset with the selected features.
valid[selected_columns].join(valid['outcome']).head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
上面的方法是使用单变量对资料做处理,每一个feature对target的影响
L1 regularization是利用全部的资料对target的影响去做判断
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
train, valid, _ = get_data_splits(baseline_data)
X, y = train[train.columns.drop("outcome")], train['outcome']
# Set the regularization parameter C=1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, y)
model = SelectFromModel(logistic, prefit=True)
X_new = model.transform(X)
X_new
array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,
1.409e+03],
[3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,
9.570e+02],
[4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,
7.390e+02],
...,
[2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,
5.150e+02],
[2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,
1.306e+03],
[2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,
1.084e+03]])
跟前面单变量的资料一样,会回传选择的columns
将columns为0的值去掉後,就会得到选择的columns
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new),
index=X.index,
columns=X.columns)
# Dropped columns have values of all 0s, keep other columns
selected_columns = selected_features.columns[selected_features.var() != 0]
Feature Selection | AUC score | 适用情形 | 效果 |
---|---|---|---|
不处理 | 0.7446 | ||
Univariate Feature Selection | 0.6010 | 资料量大,特徵多 | 效果差 |
L1 regularization | 0.7462 | 资料量小,特徵小 | 效果好 |
# 取得某一个资料的全部状态
print('Unique values in `state` column:', list(ks.state.unique()))
Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']
当取得了所有的columns,可以利用专家模式来判断哪一个是不需要的资料,可以将其去掉
ks为dataframe的资料型态
query可以在里面放入boolean或是SQL(详细用法看到再补)
# Drop live projects
ks = ks.query('state != "live"')
原始资料为
Unique values in `state` column: ['failed', 'canceled', 'successful', 'undefined', 'suspended']
想将successful转为1,其余转为0
assign为 为dataframe分配一个新的列(column)
feature = ['state', 'outcome']
# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))
ks[feature].head(6)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
ks = ks.assign(hour=ks.launched.dt.hour,
day=ks.launched.dt.day,
month=ks.launched.dt.month,
year=ks.launched.dt.year)
<<: Day 25 利用transformer自己实作一个翻译程序(七) Scaled dot product attention
>>: Day10:终於要进去新手村了-Javascript-变数
上次提到要 Arduino还是Raspberry pi 而看下面可以知道Cortex-A, Cort...
写在前面 test for placeholder test for placeholder tes...
如果要客制针对使用者输入的字串去进行搜寻, 那麽不妨在search 栏位後面加上domain fil...
请不要在意我标题出生日期一直跳,育儿的日子没那麽多废文可以写XD~而且中间很多天在干嘛其实也忘了囧...
昨日安装完Anaconda後,今天要继续安装需要的lib, (1)先用管理者权限打开Anaconda...