在这边我们会学到如何将资料做正规化(Normalization)及缩放(Scaling)
# modules we'll use
import pandas as pd
import numpy as np
# for Box-Xoc Transformation
from scipy import stats
# for min_max scaling
from mlxtend.preprocessing import minmax_scaling
# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt
# set seed for reproducibility
np.random.seed(0)
Scaling跟Normalization时常会让人搞混
这两个东西的主要差异在
现在我们来更深入的看这两件事情
表示说我们将资料做一个特别的缩放,像是0-100或0-1
若我们想缩放资料(scale data),而且我们要使用的methods使根据点与点之间的距离,像是support vector machines(SVM)或是k-nearest neighbors(KNN)。
在这些algorithms,将1变换成任何数值都会是一样重要的
举例说,若我们有一笔资料中有Yen跟US Dollor,1US dollor的价值大约等於100Yen,但若我们没有将资料做缩放的话,SVM或KNN会以为1US dollor跟1Yen依样重要
缩放变数,可以协助比较在同一个立足点中不同的变数
## generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size=1000)
# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns=[0])
# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Scaled data')
fig, ax = plt.subplots(1,2)
是代表说在同一个row中画两个column的图minmax_scaling(original_data, columns=[0])
将一笔资料缩放到0跟1之间从上图我们可以看出,我们将资料从0-8,scale到0-1
Scaling只是换掉data的距离,Normalization则是比较激进的作法
标准化(Normalization)的重点是替换掉观察的角色,让资料被描述为常态分布(normal distribution)
常态分布(Normal Distribution)被称作为钟形曲线(bell curve),这是一种特别的统计分布(statistical distribution),观察的结果会高於或低於平均值(mean),平均值(mean)和中位数(median)是一样的,有较多的观察者是靠近平均数。常态分布(Normal Distribution)也被称为高斯分布(Gaussian distribution)
通常来说,如果要使用假设数据是常态分布(Normal Distribution)的机器学习(Machine Learning)或统计技术(statistics technique),则要将数据标准化
举例来说,线性判别分析(linear discriminant analysis)(LDA)或是高斯贝式分类(Gaussian maiva Bayes)
Pro tip:任何方法有关於"高斯(Guassian)"通常都需要将资料标准化
转换的方式我们称作为Box-Cox Transformation
# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)
# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Normalized data')
Python额外知识 确认变数型别 Number = 1000 print(type(Number)...
来源 : emcthye - FxRate 架构图 MVP Base CurrencyListAct...
昨天了解了 Composite 是什麽後,一如我们本来的安排,今天要来介绍的是 Composites...
先前提到Azure有很多客户拿来做第三份备份存放外,异地备援陆续也有不少客户来询问 了解,原因就在异...
yo~ 原本写好了几个字 改一下主题就全清空拉 很QQ馁 不免俗的,这也是小女子第一次参加铁人赛 ...