Day 25 [Python ML、资料清理] 处理遗失值

一开始要先看资料

# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("./NFL Play by Play 2009-2017 (v4).csv")

# set seed for reproducibility
np.random.seed(0)

/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3072: DtypeWarning: Columns (25,51) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

当一开始取得资料後，要确保资料有没有空缺值，NaN或是None

# look at the first five rows of the nfl_data file.
# I can see a handful of missing data already!
nfl_data.head()

发现在果然有空缺值

有多少缺失资料

现在我们要来看全部的资料会有多少空缺值

# get tje number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points int the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

我们来看看空缺值占全部资料的多少比例

# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells)*100
print(percent_missing)

24.87214126835169

np.product是将值全部乘起来
nfl_data.shape是取得资料的值，取出来後的资料型态是tuple

找出为什麽会有空缺值

这一个部分称为对资料的直觉，也就是说看资料就要知道为什麽会有空缺值

若是新手的话可以先思考这个问题

Is this value missing because it wasn't recorded or because it doesn't exist?

若资料是本来就不存在，例如说要问一个人他年纪最大的孩子多高但是这个人没有孩子，那就让资料维持NaN

若是资料是漏纪录，那就去猜测他的值应该会是甚麽

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

从资料中看起来应该是漏纪录而非不存在

因此我们要想办法猜出NA的资料应该会是什麽

但是有一个有很多空缺值是因为那个资料是队伍的罚款

有些队伍的确是没有罚款，因此还是要将其当为空值

去掉空缺值

若是没有任何原因要找出为什麽值会缺失，有一个方法是直接将有缺失值的row或column去掉

若是确定要这样做的话，pandas有一个便利的function，dropna()可以解决这个问题

# remove all the rows that contain a missing value
nfl_data.dropna()

dropna()会移除掉所有的资料，那是因为所有的row都有空缺值

因此我们只要选择将column中有空缺值的去掉就可以了

# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 102 

Columns with na's dropped: 41

我们失去了一些data，但是已经没有空缺值了

自动填上空缺值

我们先从资料中取得一小部分资料

# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

可以使用Panda's fillna()这个function可以将空缺值填入

我们可以选择要将什麽值填入NaN，在这边我们将值填入0

# replace all NA's with 0
subset_nfl_data.fillna(0)

我们也可以将丢失值替换为某一些紧随其後的值

(这样的方法对於某些逻辑数据集来说很有意义)

# replace all NA's the value that comes directly after it in the same column, 
# then replace all the remaining na's with 0
subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)