Day 28 [Python ML、资料清理] 处理字元编码

Get our environment sep up

# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)

What are encodings?

Character encodings是指说从原始的二进位编码("01101000010100100"),转成人类可以读取的文本("hi")

若是尝试使用的编码和原始的编码不相同,就会得到混乱文本,称为mojibake,例如

æ–‡å—化ã??

当特定字节和读取字节的编码中没有关系时,就会打印出以下内容:

����������

现在字符编码不匹配的情况已经比较少了,但是还是有一个主要的问题,就是有很多不同的字符编码,但最需要了解的编码是UTF-8

UTF-8是标准的文本编码,所有的python代码都是UTF-8,理想情况下,所有的资料都应该要是UTF-8。当资料不是UTF-8的时候,就可能会出错

# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)
str

我们可以将str转为bytes

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors="replace")

# check the type
type(after)
bytes

若观察after的资料,会发现前面多了一个b

那是因为我们将bytes会把资料转为ASCII,这边可以看到这个符号已经被替换成像mojibake的状况"\xe2\x82\xac"

# take a look at what the bytes look like
after
b'This is the euro symbol: \xe2\x82\xac'

但是当我们再将资料解码成utf-8的时候,资料就又变成正确的了

# convert it back to utf-8
print(after.decode("utf-8"))
This is the euro symbol: €

若我们再将bytes的资料编码成ascii,就又会出错

我们可以将编码看成是录制声音时的不同方式。可以再CD跟卡带上面录制相同的音乐,虽然音乐听起来大致上是相同的,但是必须要用适合的设备来拨放。正确的解码器就像是用CD拨放器来拨放CD,若是用卡带拨放器就不能拨放CD了

# try to decode our bytes with the ascii encoding
print(after.decode("ascii"))
UnicodeDecodeErrorTraceback (most recent call last)

<ipython-input-6-50fd8662e3ae> in <module>
      1 # try to decode our bytes with the ascii encoding
----> 2 print(after.decode("ascii"))


UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

若我们将资料先编码为ascii,会造成一些字符没办法使用,再将ascii解码时,就会发现无法使用的字符解码回来之後也不会恢复原本的样子

# start with a string
before = "This is the euro symbol: €"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

# We've lost the original underlying byte string! It's been 
# replaced with the underlying byte string for the unknown character :(
This is the euro symbol: ?

所以要尽可能避免这样做,在python中,要尽量将资料保持在UTF-8

最好确认这件事情的时候是读取文件的时候,下面会说明要如何做

Reading in files with encoding problems

许多文件都有可能会使用到UTF-8,这些也是Python默认情况下所期望的,大多是情况不会遇到问题,但是有时候会出现以下错误

# try to read in a file not in UTF-8
kickstarter_2016 = pd.read_csv("./ks-projects-201612.csv")
UnicodeDecodeErrorTraceback (most recent call last)

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()


pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte


During handling of the above exception, another exception occurred:


UnicodeDecodeErrorTraceback (most recent call last)

<ipython-input-8-a0f34aff1a4b> in <module>
      1 # try to read in a file not in UTF-8
----> 2 kickstarter_2016 = pd.read_csv("./ks-projects-201612.csv")


/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    700                     skip_blank_lines=skip_blank_lines)
    701 
--> 702         return _read(filepath_or_buffer, kwds)
    703 
    704     parser_f.__name__ = name


/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    433 
    434     try:
--> 435         data = parser.read(nrows)
    436     finally:
    437         parser.close()


/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1137     def read(self, nrows=None):
   1138         nrows = _validate_integer('nrows', nrows)
-> 1139         ret = self._engine.read(nrows)
   1140 
   1141         # May alter columns / col_dict


/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1993     def read(self, nrows=None):
   1994         try:
-> 1995             data = self._reader.read(nrows)
   1996         except StopIteration:
   1997             if self._first_chunk:


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()


pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()


pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte

会发生错误较有可能的原因为我们想要将ascii编码的资料用UTF-8读取

可以透过读取前面的资料来看出编码,不需要直接就读取全部的资料

# look at the first ten thousand bytes to guess the character encoding
with open("./ks-projects-201612.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

根据前面10000笔的资料,有73%的猜测资料型态为Windows-1252,因此我们来试试看用Windows-1252来解码

# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("./ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3072: DtypeWarning: Columns (13,14,15) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Saving your files with UTF-8

好不容易将资料转为UTF-8,现在我们要将资料用csv储存起来

# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

<<:  就学时多参加企业实习,了解产业型态

>>:  Day 29 [Python ML、资料清理] 处理输入资料不一致

Day18 Web Server 相关扫描

网站服务器是恶意攻击者最常攻击的目标,因为在许多设备都会有 web 介面,常见的网站服务器为 apa...

Day 07:开发 Angular 一定要会的 TypeScript

Angular 官方建议使用两种语言来开发,一是 Dart(也是一种由 Google 开发的语言),...

ASP.NET MVC 从入门到放弃(Day11) -C# 连线资料库介绍( ADO.NET )

接着来讲讲资料库连线的部分.... Mysql 类别Class public class Categ...

Day1 启蒙

近年来软件业越来越风行,常常可以在网路上看到许多文章有着耸动的标题:「让你三个月成功转职年薪百万工程...

[Day24] 供新手参考的几个可以实作的方向

在今天的文章中,向不知从何开始建立一个Action的新手。 提供几个可以尝试发挥的方向,从而建立相...