Day 26 [Python ML、资料清理] 资料缩放以及标准化

在这边我们会学到如何将资料做正规化(Normalization)及缩放(Scaling)

取得环境

# modules we'll use
import pandas as pd
import numpy as np

# for Box-Xoc Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# set seed for reproducibility
np.random.seed(0)

缩放和标准化的差别

Scaling跟Normalization时常会让人搞混

这两个东西的主要差异在

  • Scaling - 变换data的范围(range)
  • Normalization - 变换资料形状的分布(shape of the distrubution)

现在我们来更深入的看这两件事情

资料缩放

表示说我们将资料做一个特别的缩放,像是0-100或0-1

若我们想缩放资料(scale data),而且我们要使用的methods使根据点与点之间的距离,像是support vector machines(SVM)或是k-nearest neighbors(KNN)。

在这些algorithms,将1变换成任何数值都会是一样重要的

举例说,若我们有一笔资料中有Yen跟US Dollor,1US dollor的价值大约等於100Yen,但若我们没有将资料做缩放的话,SVM或KNN会以为1US dollor跟1Yen依样重要

缩放变数,可以协助比较在同一个立足点中不同的变数

## generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size=1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns=[0])

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





Text(0.5, 1.0, 'Scaled data')

  • fig, ax = plt.subplots(1,2)是代表说在同一个row中画两个column的图
  • minmax_scaling(original_data, columns=[0])将一笔资料缩放到0跟1之间

从上图我们可以看出,我们将资料从0-8,scale到0-1

标准化

Scaling只是换掉data的距离,Normalization则是比较激进的作法

标准化(Normalization)的重点是替换掉观察的角色,让资料被描述为常态分布(normal distribution)

常态分布(Normal Distribution)被称作为钟形曲线(bell curve),这是一种特别的统计分布(statistical distribution),观察的结果会高於或低於平均值(mean),平均值(mean)和中位数(median)是一样的,有较多的观察者是靠近平均数。常态分布(Normal Distribution)也被称为高斯分布(Gaussian distribution)

通常来说,如果要使用假设数据是常态分布(Normal Distribution)的机器学习(Machine Learning)或统计技术(statistics technique),则要将数据标准化

举例来说,线性判别分析(linear discriminant analysis)(LDA)或是高斯贝式分类(Gaussian maiva Bayes)

Pro tip:任何方法有关於"高斯(Guassian)"通常都需要将资料标准化

转换的方式我们称作为Box-Cox Transformation

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





Text(0.5, 1.0, 'Normalized data')


<<:  Vue.js 从零开:v-bind:is 动态元件

>>:  23 - Prettier - 格式化程序码工具

Day 08: Python额外知识小补充

Python额外知识 确认变数型别 Number = 1000 print(type(Number)...

Android 不负责任系列 - emcthye FxRate(汇率)

来源 : emcthye - FxRate 架构图 MVP Base CurrencyListAct...

Day24. 发动魔法卡,融合 - Composite (中)

昨天了解了 Composite 是什麽後,一如我们本来的安排,今天要来介绍的是 Composites...

[Day28] 第二十八课 Azure灾害复原(DRaaS)-1[进阶]

先前提到Azure有很多客户拿来做第三份备份存放外,异地备援陆续也有不少客户来询问 了解,原因就在异...

Day 01 - 前言

yo~ 原本写好了几个字 改一下主题就全清空拉 很QQ馁 不免俗的,这也是小女子第一次参加铁人赛 ...