Day 22 [Python ML、资料视觉化] 散布图

设定jupyter notebook

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

Setup Complete

读取资料和显示资料

# Path of the file to read
insurance_filepath = "./insurance.csv"

# Read the file into a variable insurance_data
insurance_data = pd.read_csv(insurance_filepath)

读取完资料後，可以将其前5笔资料印出

insurance_data.head()

散布图

要创建一个简单的散布图，需要先设定x轴跟y轴需要的资料

sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])

<AxesSubplot:xlabel='bmi', ylabel='charges'>

以上图来说，BMI越高的人，被收的费用理论上也会越多

可以在图表中多加一条回归线(regression line)，可以确保猜测是对的

sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])

<AxesSubplot:xlabel='bmi', ylabel='charges'>

Color-coded scatter plots

若我们想在图表中看出吸菸(smoke)跟BMI还有收费(charge)之间的关系，可以将图表中加入颜色

sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])

<AxesSubplot:xlabel='bmi', ylabel='charges'>

这时我们可以使用sns.lmplot来看出这两个区间的回归线差别

sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)

<seaborn.axisgrid.FacetGrid at 0x7f894623c080>

可以看出吸菸者的回归线比没有吸菸的人高陡峭很多

sns.lmplot跟之前遇到的产生图表的方法有些许的差异

之前取x轴的方法为x=insurance_data['bmi']，在这个方法中只需要用x="bmi"
y轴跟hue也是
使用data=insurance_data可以读取档案

若是想做categorical scatter plot的图表，可以使用swarmplot来绘制图表

sns.swarmplot(x=insurance_data['smoker'],
              y=insurance_data['charges'])

/opt/conda/lib/python3.6/site-packages/seaborn/categorical.py:1296: UserWarning: 67.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)





<AxesSubplot:xlabel='smoker', ylabel='charges'>