AI ninja project [day 15] 文字处理--BERT分类

参考页面:
https://www.tensorflow.org/text/tutorials/classify_text_with_bert?hl=zh_tw

以及colab:
https://www.tensorflow.org/tutorials/images/segmentation?hl=zh_tw

首先，建议使用colab 开启GPU的环境来运行，
本地端在只有CPU的情况运行，非常缓慢。

资料集使用的是IMDB影评的电影评论分成正向的评论及负面的评论，
未来希望能以未知的评论来预测，该评论是正面的还是负面的。

在使用tf.keras.preprocessing.text_dataset_from_directory()，请记得要把与训练集与测试集档案、资料夹等不相关的资讯移除，避免训练错误。
可以发现资料集有train、test两个资料夹，来区分训练集与测试集，
个别又有pos、neg两个类别，放置正向评论以及负向评论。

由於会对文字输入资料进行处理，所以安装tensorflow-text这个套件:

相对於过去我们只使用adam优化器，
这次我们想使用adamw优化器，所以安装套件，如果改回使用adam也能正常使用:

引入套件:

下载资料集，去除不相关的unsup资料夹:

切分训练、验证、测试集(使用缓存来增加训练速度):

後面我们会用tf hub套件来载入前处理层(tfhub_handle_preprocess )及BERT模型(tfhub_handle_encoder)，
这里先设定要载入的url:

tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'

如果想要使用中文的模型:

tfhub_handle_encoder = "https://hub.tensorflow.google.cn/tensorflow/bert_zh_L-12_H-768_A-12/4"

中文前处理:

import tensorflow_hub as hub
import tensorflow_text as text

tfhub_handle_preprocess = "https://hub.tensorflow.google.cn/tensorflow/bert_zh_preprocess/3"
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

text_test = ['我来自长庚大学']
text_preprocessed = bert_preprocess_model(text_test)

print(f'Keys       : {list(text_preprocessed.keys())}')
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :12]}')
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :12]}')