Python 演算法 Day 1 - 程序基础 & 简介

Chap.O 程序基础 & 简介:

Part 1. 常用於演算法的开发程序,有以下几种:

1-1. Python (免费,套件多,系统整合佳)

1-2. R (免费,套件多,系统整合差)

1-3. Matlab (贵,套件少但功能完整,系统整合佳)

Part 2. Python 能做甚麽?

  1. Program development 程序开发
  2. Website development, crawler 网站开发、爬虫
  3. Statistics, Mathematics 统计、数学
  4. Programming language 程序开发入门语言
  5. System Management Script 系统管理脚本
  6. Data Science 资料科学(着重分析资料)
  7. Data Mining Algorithms 数据挖掘算法(着重分析资料)
  8. Deep Learning: Neural Network、CNN/RNN 深度学习:神经网路(着重预测资料)

Part 3. 那麽,AI 又有哪些应用领域呢?

  1. Natural Language Understanding 自然语言处理
  2. Computer Vision 电脑视觉
  3. Speech Understanding 语音辨识
  4. Robotic Application 机器人应用
  5. Intelligent Agent 智慧型代理人:聊天机器人、AlphaGo...etc.
  6. Self driving Car 自驾车
  7. 医疗:MRI 影像处理、诊断、新药开发...etc.
  8. 智慧制造、智慧农业、智慧理财...etc.


Chap.I 理论基础:


Part 1:Linear algebra 线性代数

Part 2:Differential & Integral 微积分

Part 3:Vector 向量

Part 4:Statistics & Probability 统计&机率

Chap.II 深度学习与模型优化:

所有预测模型,都离不开下图 10 大步骤。此章节会依序解释每个步骤的应用。

sklearn 简介-如何选择一个合适的演算法


Part 1. Supervised 监督式学习:

资料经过 Lebaling 标签化,即有正确解答。

Classification 分类:



import pandas as pd
import numpy as np
from sklearn import datasets     # 引用 Scikit-Learn 中的 套件 datasets

# 1. Data Set
ds = datasets.load_iris()        # dataset: 引用 datasets 中的函数 load_iris
print(ds.DESCR)                  # DESCR: description,描述载入内容

X =pd.DataFrame(, columns=ds.feature_names)
y =

# 2. Data clean (missing value check)
>>  sepal length (cm)    0
    sepal width (cm)     0
    petal length (cm)    0
    petal width (cm)     0
    dtype: int64

# 3. Feature Engineering
# No need

# 4. Data Split (Training data & Test data)
from sklearn.model_selection import train_test_split    

# test_size=0.2: 测试用资料为 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

print(X_train.shape, y_train.shape)

>>  (120, 4) (120,)

# 5. Define and train the KNN model
from sklearn.neighbors import KNeighborsClassifier

# n_neighbors=: 超参数 (hyperparameter)
clf = KNeighborsClassifier(n_neighbors = 3)

# 适配 (训练),回归/分类/降维...皆用 fit(), y_train)

# algorithm.score: 使用 test 资料 input,并根据结果评分
print(f'score={clf.score(X_test, y_test)}')

>>  score=0.9

# 验证答案
print(' '.join(y_test.astype(str)))
print(' '.join(clf.predict(X_test).astype(str)))

>>  1 2 0 0 0 2 1 1 1 0 1 2 2 2 0 2 1 1 1 0 1 1 2 2 1 1 0 2 2 2
    1 2 0 0 0 2 1 1 1 0 1 1 2 2 0 2 1 1 1 0 1 1 2 2 1 2 0 2 1 2

# 查看预测的机率
print(clf.predict_proba(X_test.head()))  # 预测每个 x_test 机率

>>  [[0. 1. 0.]
     [0. 0. 1.]
     [1. 0. 0.]
     [1. 0. 0.]
     [1. 0. 0.]]


import pandas as pd
import numpy as np
from sklearn import datasets

# 1. Dataset
ds = datasets.load_breast_cancer()
X =pd.DataFrame(, columns=ds.feature_names)
y =

# 2. Data clean
# no need

# 3. Feature Engineering
# no need

# 4. Split
from sklearn.model_selection import train_test_split    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# 5. Define and train the KNN model
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 3)

# 适配(训练),回归/分类/降维...皆用 fit(x_train, y_train), y_train)

# algorithm.score: 使用 test 资料 input,并根据结果评分
print(f'score={clf.score(X_test, y_test)}')

>>  score=0.9210526315789473

# 验证答案
print(' '.join(y_test.astype(str)))
print(' '.join(clf.predict(X_test).astype(str)))

>>  1 1 0 0 0 ... 0
    1 1 0 0 0 ... 0

# 查看预测的机率

>>  [[0. 1.]
     [0. 1.]
     [1. 0.]
     [1. 0.]
     [1. 0.]]

Regression 回归:





# 1. DataSet
year=[1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2040, 2041, 2042, 2043, 2044, 2045, 2046, 2047, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, 2058, 2059, 2060, 2061, 2062, 2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2078, 2079, 2080, 2081, 2082, 2083, 2084, 2085, 2086, 2087, 2088, 2089, 2090, 2091, 2092, 2093, 2094, 2095, 2096, 2097, 2098, 2099, 2100]
pop=[2.53, 2.57, 2.62, 2.67, 2.71, 2.76, 2.81, 2.86, 2.92, 2.97, 3.03, 3.08, 3.14, 3.2, 3.26, 3.33, 3.4, 3.47, 3.54, 3.62, 3.69, 3.77, 3.84, 3.92, 4.0, 4.07, 4.15, 4.22, 4.3, 4.37, 4.45, 4.53, 4.61, 4.69, 4.78, 4.86, 4.95, 5.05, 5.14, 5.23, 5.32, 5.41, 5.49, 5.58, 5.66, 5.74, 5.82, 5.9, 5.98, 6.05, 6.13, 6.2, 6.28, 6.36, 6.44, 6.51, 6.59, 6.67, 6.75, 6.83, 6.92, 7.0, 7.08, 7.16, 7.24, 7.32, 7.4, 7.48, 7.56, 7.64, 7.72, 7.79, 7.87, 7.94, 8.01, 8.08, 8.15, 8.22, 8.29, 8.36, 8.42, 8.49, 8.56, 8.62, 8.68, 8.74, 8.8, 8.86, 8.92, 8.98, 9.04, 9.09, 9.15, 9.2, 9.26, 9.31, 9.36, 9.41, 9.46, 9.5, 9.55, 9.6, 9.64, 9.68, 9.73, 9.77, 9.81, 9.85, 9.88, 9.92, 9.96, 9.99, 10.03, 10.06, 10.09, 10.13, 10.16, 10.19, 10.22, 10.25, 10.28, 10.31, 10.33, 10.36, 10.38, 10.41, 10.43, 10.46, 10.48, 10.5, 10.52, 10.55, 10.57, 10.59, 10.61, 10.63, 10.65, 10.66, 10.68, 10.7, 10.72, 10.73, 10.75, 10.77, 10.78, 10.79, 10.81, 10.82, 10.83, 10.84, 10.85]
df = pd.DataFrame({'year' : year, 'pop' : pop})

# 2. 求 1 次项均方误差 MSE (Mean-Square Error)
in_year = int(input('Please input 1950~2100 to calculation:'))
fit1 = np.polyfit(x, y, 1)

if 2100 >= in_year >= 1950:
    print('The actual pop is:', y[in_year-1950])
    print('Predict pop is:', f'{(np.poly1d(fit1)(in_year)):.2}')
    y1 = fit1[0]*np.array(x) + fit1[1]
    print('MSE is:', f'{((y - y1)**2).mean():.2}')
    print('Wrong year!')

# 3. 作图
def ppf(x, y, order):
    fit = np.polyfit(x, y, order)      # 线性回归,求 y=a + bx^1+ cx^2 ...的参数
    p = np.poly1d(fit)                 # 将 polyfit 回归解代入
    t = np.linspace(1950, 2100, 2000)
    plt.plot(x, y, 'ro', t, p(t), 'b--')

plt.figure(figsize=(18, 4))
titles = ['fitting with 1', 'fitting with 3', 'fitting with 50']
for i, o in enumerate([1, 3, 50]):
    plt.subplot(1, 3, i+1)
    ppf(year, pop, o)
    plt.title(titles[i], fontsize=20)


import pandas as pd
import numpy as np
from sklearn import datasets

# 1. Dataset
ds = datasets.load_boston()
X =pd.DataFrame(, columns=ds.feature_names)
y =

# 2. Data clean

# 3. Feature Engineering

# 4. Split
from sklearn.model_selection import train_test_split    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
>> (404, 13) (404,)

# 5. Define and train the LinearRegression model
from sklearn.linear_model import LinearRegression

clf = LinearRegression()

# 适配(训练),回归/分类/降维...皆用 fit(x_train, y_train), y_train)

# algorithm.score: 使用 test 资料 input,并根据结果评分
print(f'score={clf.score(X_test, y_test)}')
Part 2. Unsupervised 非监督式学习:

部分或者全部资料 Unlebaling 无标签化,即没有正确解答。

2-1. Clustering 集群

将特徵相近的点归类,概念有些类似 Regression,称为集群。如下图

以下为 CLV (Regression) 范例:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

ds = pd.read_csv('CLV.csv')

A. 手动分群

分 1~10群,计算误差平方和 (elbow method) 最少者为优。

# 没有 y

from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):      
    km=KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')

可以取用 2 群、4 群 or 10 群。

B. 自动分群

使用 sklearn 内建计算轮廓系数 (Silhoutte Coefficient)

from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

for n_cluster in range(2, 11):
    kmeans = KMeans(n_clusters=n_cluster).fit(X)
    label = kmeans.labels_
    sil_coeff = silhouette_score(X, label, metric='euclidean')
    print(f"n_clusters={n_cluster}, Silhouette Coefficient is {sil_coeff:.4}")
>>  n_clusters=2, Silhouette Coefficient is 0.4401
    n_clusters=3, Silhouette Coefficient is 0.3596
    n_clusters=4, Silhouette Coefficient is 0.3721
    n_clusters=5, Silhouette Coefficient is 0.3617
    n_clusters=6, Silhouette Coefficient is 0.3632
    n_clusters=7, Silhouette Coefficient is 0.3629
    n_clusters=8, Silhouette Coefficient is 0.3538
    n_clusters=9, Silhouette Coefficient is 0.3441
    n_clusters=10, Silhouette Coefficient is 0.3477

分成 9 群效果最显着。


# Fitting kmeans to the dataset
km4=KMeans(n_clusters=8,init='k-means++', max_iter=300, n_init=10, random_state=0)
y_means = km4.fit_predict(X)

# Visualising the clusters for k=4
plt.scatter(X[y_means==0,0],X[y_means==0,1],s=50, c='purple',label='Cluster1')
plt.scatter(X[y_means==1,0],X[y_means==1,1],s=50, c='blue',label='Cluster2')
plt.scatter(X[y_means==2,0],X[y_means==2,1],s=50, c='green',label='Cluster3')
plt.scatter(X[y_means==3,0],X[y_means==3,1],s=50, c='cyan',label='Cluster4')
plt.scatter(X[y_means==4,0],X[y_means==4,1],s=50, c='yellow',label='Cluster5')
plt.scatter(X[y_means==5,0],X[y_means==5,1],s=50, c='black',label='Cluster6')
plt.scatter(X[y_means==6,0],X[y_means==6,1],s=50, c='brown',label='Cluster7')
plt.scatter(X[y_means==7,0],X[y_means==7,1],s=50, c='red',label='Cluster8')

plt.scatter(km4.cluster_centers_[:,0], km4.cluster_centers_[:,1],s=200,marker='s', c='red', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('Annual income of customer')
plt.ylabel('Annual spend from customer on site')

Note: 一般客户分析会使用 RFM (Recency-Frequency-Monetary) 分析
此为机器学习第三步:Feature Engineering

Part 3. Reinforcement 强化学习:




Homework 小费的回归 (regression):

请使用 sklearn 内建的 Datasets,依照上述步骤完成以下资料的回归or分类:

1. 红酒分类

提示:ds = datasets.load_wine()

2. 糖尿病回归

提示:ds = datasets.load_diabetes()

2. 小费回归

提示:ds = datasets.load_tips()


1. 精通 Python (Bill Lubanovic) + github

2. Python Data Science Handbook (Jake VanderPlas) + github

3. 精通机器学习 使用 Scikit-Learn, Keras 与 TansorFlow

4. DEEP LEARNING (Ian Goodfellow) 非学术,非常难看不要看...

