Day 20 利用transformer自己实作一个翻译程序(二) 建立环境和下载资料集

前言

一开始我会先实作葡萄牙翻译成英文的模型，之後确定哪一个中翻英的资料集比较好之後，会再打一篇教学

建立环境

!pip install tensorflow_datasets
!pip install -U tensorflow-text

这只程序需要先install tensorflow_datasets跟更新tensorflow-text(-U是更新的意思)

import collections
import logging
import os
import pathlib
import re
import string
import sys
import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow_text as text
import tensorflow as tf

import这些套件，如果有出现问题的话就知道是哪一个套件没有安装好了，再Day 16 self-attention的实作准备(二) 设定tensorflow和keras的环境中有提到tensorflow要如何安装并且确定版本

logging.getLogger('tensorflow').setLevel(logging.ERROR)  # suppress warnings

在这份文档中有提到
logging.getLogger('tensorflow')是取得tensorflow这个套件的log，後面的setLevel是说如果在某个层级以上的错误才会显示出来

因此setLevel(logging.ERROR)是说除了ERROR层级以上的错误之外，其他的log不会显示出来

下载资料集

用tensorflow的dataset将葡萄牙语转英文的翻译资料集下载下来

这一个资料集有50000个训练资料，1100个验证资料以及2000笔测试资料

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

with_info=True的参数是代表说回传的时候会回传Dataset跟DatasetInfo
as_supervised=True的参数是代表说回传资料集的时候，会帮你分好input跟label

for pt_examples, en_examples in train_examples.batch(3).take(1):
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))

  print()

  for en in en_examples.numpy():
    print(en.decode('utf-8'))

e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .

and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .

这几行程序是把资料集中的葡萄牙语跟英文的3笔资料print出来

<<: Day 5 情报收集 - Information Gathering (IDS/IPS Identification)

>>: Day 14: 人工神经网路初探激活函数(中)

Day 29：

杂谈

Day 22：1863. Sum of All Subset XOR Totals

杂谈

[Day1] 何谓自然语言处理

杂谈

无法在Windows 10中创建修复磁碟机

杂谈

Day02 -本机环境准备，安装Python

杂谈

Day 27 Azure machine learning: Schedule- Azure 为你定期执行任务

Azure machine learning: Schedule- Azure 为你定期执行任务前...

Spring Framework X Kotlin Day 9 Rest Repository

GitHub Repo https://github.com/b2etw/Spring-Kotlin...

[Day_8]资料储存容器 (2) - 串列(list)_(1)

今天要来跟大家介绍串列(list)，串列为可修改的序列资料，可以修改元素资料、新增、删除、插入、...

Day06_本部的规范就是没有规范XD"如果听到这句，是要兴奋的举手我来还是原地放生，逃跑呢?XD"

今天没有前言，幽默感本人去见周公，还未回归。 └第六章、规划 6. 规划 6.1 因应风险及机会之行...

Day 25 似 Trello 的开源看板管理工具 - Wekan

Trello 作为专业的专案管理软件，在开源的世界中也会随之诞生一些类似操作的工具。今天要简介的 W...