Day 8 [Python ML、特徵工程] 基准模型(Baseline Model)

前言

今天开始是新的章节,因此也有新的资料集
Kickstarter Projects

在开始之前要先将资料集下载好丢到Dataset资料夹中

读取资料

import pandas as pd
ks = pd.read_csv('./ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])
ks.head(6)

取得某一行资料的状态

# 取得某一个资料的全部状态
print('Unique values in `state` column:', list(ks.state.unique()))
Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']

利用query指令去掉状态中的live

ks = ks.query('state!="live"')
print(list(ks.state.unique()))
['failed', 'canceled', 'successful', 'undefined', 'suspended']

在state中,将successful标注为1,其余的都标注为0

feature = ['state', 'outcome']
# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))
ks[feature].head(6)

将时间资料转换成hour day month year并加入原始资料中

ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

利用LabelEncoder处理类别属性资料

from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# 将labelencoder应用到每一个column中
encoded = ks[cat_features].apply(encoder.fit_transform)

将处理好的类别资料加入以下栏位中

data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)
data.head()

将资料分割为训练(train)、验证(valid)、测试(test)

valid_fraction = 0.1
valid_size = int(len(data) * valid_fraction)

train = data[:-2 * valid_size]
valid = data[-2 * valid_size:-valid_size]
test = data[-valid_size:]

lightGBM model

import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)
[LightGBM] [Info] Number of positive: 107340, number of negative: 193350
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008400 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 528
[LightGBM] [Info] Number of data points in the train set: 300690, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.356979 -> initscore=-0.588501
[LightGBM] [Info] Start training from score -0.588501

预测并且评估模型

from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

print(f"Test AUC score: {score}")
Test AUC score: 0.7472160532987071

<<:  [面试]做好自我检核,面试就是上战场!

>>:  自动化测试,让你上班拥有一杯咖啡的时间 | Day 7 - 如何写断言

[Day22] Esp32用STA mode + AHT10 - (程序码讲解)

1.前言 这边主要是为解说前几篇关於AHT10的程序码,此次主要讲的部分是loop中的程序码,因为透...

如何让网路社团的发文得到较好的转换效果

透过网路社团发文做行销,因为几乎等於零成本,所以一直都是很热门的行销管道,但要得到好的发文转换效果,...

线上黑客松!

在我第一年参加铁人赛的完赛日不久後,我在六角学院的社团发现他们与 KKBOX 合作,推广 KKBOX...

Day-28 TimePickerDialog

不知道各位有没有使用过麦当劳报报? 在麦当劳报报APP当中,使用者必须设定时段, 使程序在设定的时段...

[Day12] 以神经网络进行时间序列预测 — LSTM

本篇详细介绍 LSTM 及如何以 LSTM 建模预测时间序列。 本日大纲 LSTM 介绍 LSTM ...