Day 14 [Python ML、Pandas] 引索、选择和给值

Introduction

为了让资料更好的处理，这边要学到如何切割资料

import pandas as pd
reviews = pd.read_csv("./winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)

reviews

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Native accessors

pandas提供了一个方法可以将特定的column取出来

若我们需要取出country的资料，只需要reviews.country

reviews.country

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

也可以用另一个方法来取得country的资料，可以使用中括号

reviews['country']

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

若column中有空格，就没有办法用reviews.country providence上面的方法取得资料了

就需要用reviews['country providence']来取得资料

取得资料後可以再用一个[]取得里面的资料

reviews['country'][0]

'Italy'

Indexing in pandas

pandas有自己存取资料的方式，loc and iloc

Index-based selection

基於使用index来选择资料

若要选择一个row，可以使用iloc这个指令

reviews.iloc[0]

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

loc 跟 iloc都是 row-first, column-second

若我们要取得第一个column，可以使用以下的方法

reviews.iloc[:, 0]

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

再python中，:这个符号代表说全部的值

若要取得前3个值，可以使用以下的方法

reviews.iloc[:3, 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

若只要选择1跟2的资料

reviews.iloc[1:3, 0]

1    Portugal
2          US
Name: country, dtype: object

在前面的参数中也可以放入list

reviews.iloc[[0, 1, 2], 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

若要取得最後5笔资料

reviews.iloc[-5:]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Label-based selection

基於label的资料来取得资料

跟iloc一样，只是需要放入的值为label

reviews.loc[0, 'country']

'Italy'

上面的方法为取得第0个row country这个column

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

也可以利用以下的方法取得特定column资料

一般的情况下是 column-first, row-second

在iloc和loc的情况下，为 row-first, column-second

reviews[['taster_name', 'taster_twitter_handle', 'points']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Choosing between `loc` and `iloc`

有几个部分会有些许的差别

df.iloc[0:1000]若使用这个方式会取得1000笔资料

df.loc[0:1000]若使用这个方式则会取得1001笔资料

这两个function还是需要看情况用

Manipulating the index

使用set_index可以将index改成更适合的column

reviews.set_index("title")

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Conditional selection

可以利用判断式来知道country是否为Italy

reviews.country == 'Italy'

0          True
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

也可以将所有country符合Italy的资料取出来

reviews.loc[reviews.country == 'Italy']

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

若觉得这样资料量还是太多，还可以再用其他方法将资料取出来

可以再loc中在加入&做运算

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

若只是想知道country是Italy 或 points 大於等於 90，可以用|做运算

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

有一个isin函式，可以抓出有某些值的资料

reviews.loc[reviews.country.isin(['Italy', 'France'])]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

另外也可以利用notnull函式，找出不包含NaN的资料

reviews.loc[reviews.price.notnull()]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Assigning data

在pandas中，要给值是非常简单的

reviews['critic'] = 'everyone'
reviews['critic']

0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone
Name: critic, Length: 129971, dtype: object

或是给数值

reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']

0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

tags: `IT铁人赛使用python学习Machine Learning`

<<: table表格标签-基础语法

>>: Day27 切版笔记 - 破格式设计

ES6 常用方法

杂谈

【从零开始的Swift开发心路历程-Day4】Xcode介面基础介绍

杂谈

Vue.js 从零开始 mitt

杂谈

第8-1章：管理本地端主机之使用者与群组(三)

杂谈

< 关於 React: 开始打地基| 图片的使用方式 >

杂谈

管制与自我管理

管制也要带来成长一提到管制不知道大家会想到什麽，也许是国家法规、公司规章，又或是规模小一点的上下班...

Day 18 (Xd)

1.开启自己手机的解析度面板安卓计算: 495ppi-->对应的dpi范围为480以上--&...

[Day25] 程序码重构

接下来要回去弄日K交易策略，先把网格交易机器人打包起来独立成一个档案，还有把登入的部分打包起来，未来...

Day 15 - UML x Interface — Notifier

UML Notifier 的 UML 主要是根据 Ant Design 的设计画出来的，而在 Ma...

缺乏计画的目标，只能叫做愿望。----目标设定篇(上)

缺乏计画的目标，只能叫做愿望。 A goal without a plan is just a wi...

Introduction

Native accessors

Indexing in pandas

Index-based selection

Label-based selection

Choosing between loc and iloc

Manipulating the index

Conditional selection

Assigning data

tags: IT铁人赛 使用python学习Machine Learning

Choosing between `loc` and `iloc`

tags: `IT铁人赛使用python学习Machine Learning`