Day-09 Logistic Regression 实作（修正版）

我们今天就要利用 sklearn 提供的 Iris（鸢尾花）资料，并且手工撰写 logistic regression 来分类他们，并且利用今天的实作，解释一下整个从零开始的模型训练跟测试，完整的示范一遍基础的流程

资料取得、资料前处理和目标订定

首先当然是先取得资料，然後来看看我们的目标是什麽
所以这边先引入鸢尾花资料集，并且作基础的资料判断

from sklearn import datasets

iris = datasets.load_iris()
# print(iris.DESCR)

可以看到我们的资料有三个种类，且有四种不同的参数特徵
为了方便示范，我们就取两个类别，且只留两种特徵做判断
所以下面的操作会是取两种资料特徵跟四个特徵

import pandas as pd
import numpy as np

# use pandas as dataframe and merge features and targets
feature = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.DataFrame(iris.target, columns=['target'])
iris_data = pd.concat([feature, target], axis=1)

# keep only sepal length in cm, sepal width in cm and target
iris_data = iris_data[['sepal length (cm)', 'sepal width (cm)', 'target']]

# keep only Iris-Setosa and Iris-Versicolour classes
iris_data = iris_data[iris_data.target <= 1]
# print(iris_data.head(5))

那我们的目标就是基於两个特徵，去判断品种
我们先把训练资料跟测试资料分类出来，在这边可以使用 sklearn 提供的 model_selection 函式把资料分为两群 train、test
那这边要注意 Logistic Regression 我们还会先对资料做特徵的缩放（为了避免梯度下降时，因为资料特徵数值差异过大而造成不必要的效率问题，可以想像成 normalize），因此会用到 sklearn 的 StandardScaler 套件

from sklearn.model_selection import train_test_split

train_feature, test_feature, train_target, test_target = train_test_split(
    iris_data[['sepal length (cm)', 'sepal width (cm)']], iris_data[['target']], test_size=0.3, random_state=4
)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
train_feature = sc.fit_transform(train_feature)
test_feature = sc.fit_transform(test_feature)
train_target = np.array(train_target)
test_target = np.array(test_target)
# print(train_feature, test_feature)

Logistic Regression

前面我们有资料了，那就是开始撰写我们的 Logistic Regrssion 吧~
我们前面有提到说 Logistic Regession 其实跟 Linear Regression 差别就只差了 sigmoid function，所以我们来看看程序会怎麽写

# 1) model
# f = wx + b, sigmoid at the end
class LogisticRegression():
    def __init__(self):
        super(LogisticRegression, self).__init__()

    def linear(self, x, w, b):

        return np.dot(x, w) + b

    def sigmoid(self, x):

        return 1/(1 + np.exp(-x))

    def forward(self, x, w, b):
        y_pred = self.sigmoid(self.linear(x, w, b)).reshape(-1, 1)

        return y_pred


model = LogisticRegression()

检查参数 & 更新程序

前面有提到说 Logistic Regression 的 Loss check 不能是 MSE，因此这边我们用 CrossEntropy 作为 loss function，那笔者相信自己数学解释的不是那麽好 QQ，所以想了解详细数学的可能麻烦自己去查一下了
那这边的参数更新（这边我们就开始称呼为优化，因为确实是希望把参数更新的越来越好），就选择跟 Linear Regression 一样的 Gradient Descent

# 2) loss and optimizer
learning_rate = 0.01

# CrossEntropy
class BinaryCrossEntropy():
    def __init__(self):
        super(BinaryCrossEntropy, self).__init__()
    
    def cross_entropy(self, y_pred, target):
        x = target*np.log(y_pred) + (1-target)*np.log(1-y_pred)

        return -(np.mean(x))

    def forward(self, y_pred, target):

        return self.cross_entropy(y_pred, target)

# GradientDescent
class GradientDescent():
    def __init__(self, lr=0.1):
        super(GradientDescent, self).__init__()
        self.lr = lr

    def forward(self, w, b, y_pred, target, data):
        w = w - self.lr * np.mean(data * (y_pred - target), axis=0)
        b = b - self.lr * np.mean((y_pred - target), axis=0)

        return w, b


criterion = BinaryCrossEntropy()
optimizer = GradientDescent(lr=learning_rate)

开始训练

一样依照 检查错误率 -> 更新参数 的方式去修正资料，并且操作想要的次数

# 3) training loop
w = np.array([0, 0])
b = np.array([0])
num_epochs = 100
for epoch in range(num_epochs):
    for i, data in enumerate(train_feature):
        # forward pass and loss
        y_pred = model.forward(data, w, b)
        loss = criterion.forward(y_pred, train_target[i])

        # update
        w, b = optimizer.forward(w, b, y_pred, train_target[i], data)

    if (epoch+1) % 10 == 0:
        print(f'epoch {epoch + 1}: loss = {loss}')

最後让我们看看测试的准确率

# checking testing accuracy
y_pred = model.forward(test_feature, w, b)
y_pred_cls = y_pred.round()
acc = np.equal(y_pred_cls, test_target).sum() / float(test_target.shape[0])
print(f'accuracy = {acc: .4f}')