Day 3 [Python ML] 选择建模用的资料(DecisionTree)

前言

一开始先接续昨天读取资料的部分，先使用pd.read_csv来读取资料
再利用DataFrame的columns来看有哪些columns

import pandas as pd

melbourne_file_path = './Dataset/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

将缺失值去掉

由於这份资料有一些缺失值，後面会学到如何处理缺失值
而这边则是先用最简单的方法，就是只要有缺失值就直接把那一个row去掉

# 将缺失值去掉
melbourne_data = melbourne_data.dropna(axis=0)

从图中可以看到去掉缺失值後数量从13580减少到6196

选择要预测的目标

# 将Price的资料放入y的变数中
y = melbourne_data.Price
# 选择需要的feature
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
# 将选择的feature放入x的变数中
X = melbourne_data[melbourne_features]
# 原资料为数值，describe可以将这些资料转为count, mean, std, min...
X.describe()

# head可以看到资料中的前几笔资料
X.head()

建立model

# import DecisionTree这个Model 
from sklearn.tree import DecisionTreeRegressor

# 定义model，使用random_state来确保每次产生的结果都是一样
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit模型
melbourne_model.fit(X, y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

许多ML的模型都会允许使用一些随机的model来训练。指定一个数字给random_state，可以确保每次训练的结果都是一样的。不管用哪一个数字都不会影像到模型训练的结果。

# 将结果print出来
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]