Day 26 [Python ML、资料清理] 资料缩放以及标准化

在这边我们会学到如何将资料做正规化(Normalization)及缩放(Scaling)

取得环境

# modules we'll use
import pandas as pd
import numpy as np

# for Box-Xoc Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# set seed for reproducibility
np.random.seed(0)

缩放和标准化的差别

Scaling跟Normalization时常会让人搞混

这两个东西的主要差异在

Scaling - 变换data的范围(range)
Normalization - 变换资料形状的分布(shape of the distrubution)

现在我们来更深入的看这两件事情

资料缩放

表示说我们将资料做一个特别的缩放，像是0-100或0-1

若我们想缩放资料(scale data)，而且我们要使用的methods使根据点与点之间的距离，像是support vector machines(SVM)或是k-nearest neighbors(KNN)。

在这些algorithms，将1变换成任何数值都会是一样重要的

举例说，若我们有一笔资料中有Yen跟US Dollor，1US dollor的价值大约等於100Yen，但若我们没有将资料做缩放的话，SVM或KNN会以为1US dollor跟1Yen依样重要

缩放变数，可以协助比较在同一个立足点中不同的变数

## generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size=1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns=[0])

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





Text(0.5, 1.0, 'Scaled data')

fig, ax = plt.subplots(1,2)是代表说在同一个row中画两个column的图
minmax_scaling(original_data, columns=[0])将一笔资料缩放到0跟1之间

从上图我们可以看出，我们将资料从0-8，scale到0-1

标准化

Scaling只是换掉data的距离，Normalization则是比较激进的作法

标准化(Normalization)的重点是替换掉观察的角色，让资料被描述为常态分布(normal distribution)

常态分布(Normal Distribution)被称作为钟形曲线(bell curve)，这是一种特别的统计分布(statistical distribution)，观察的结果会高於或低於平均值(mean)，平均值(mean)和中位数(median)是一样的，有较多的观察者是靠近平均数。常态分布(Normal Distribution)也被称为高斯分布(Gaussian distribution)

通常来说，如果要使用假设数据是常态分布(Normal Distribution)的机器学习(Machine Learning)或统计技术(statistics technique)，则要将数据标准化

举例来说，线性判别分析(linear discriminant analysis)(LDA)或是高斯贝式分类(Gaussian maiva Bayes)

Pro tip:任何方法有关於"高斯(Guassian)"通常都需要将资料标准化

转换的方式我们称作为Box-Cox Transformation

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





Text(0.5, 1.0, 'Normalized data')