Day 21 利用transformer自己实作一个翻译程序(三) 文字标签化和去标签化

前言

昨天讲到要怎麽建立环境和下载资料集，今天要来讲文字的处理

文字标签化和去标签化

由於模型没有办法直接训练文字，因此要对文字做一些处理

这些文字要先转换成一些数字，有一个处理方法在这个网站

这边要先下载并且import储存的资料集

model_name = "ted_hrlr_translate_pt_en_converter"
tf.keras.utils.get_file(
    f"{model_name}.zip",
    f"https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip",
    cache_dir='.', cache_subdir='', extract=True
)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/models/ted_hrlr_translate_pt_en_converter.zip
188416/184801 [==============================] - 0s 0us/step
196608/184801 [===============================] - 0s 0us/step
'./ted_hrlr_translate_pt_en_converter.zip'

接着要让tensorflow去读取下载的model

tokenizers = tf.saved_model.load(model_name)

透过dir()可以知道tokenizers有哪些method可以使用

[item for item in dir(tokenizers.en) if not item.startswith('_')]

['detokenize',
 'get_reserved_tokens',
 'get_vocab_path',
 'get_vocab_size',
 'lookup',
 'tokenize',
 'tokenizer',
 'vocab']

tokenize就是将string转成ID的方法

for en in en_examples.numpy():
  print(en.decode('utf-8'))

and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .

en_examples是昨天切出来的资料集中的前三笔资料

encoded = tokenizers.en.tokenize(en_examples)

for row in encoded.to_list():
  print(row)

[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]
[2, 87, 90, 107, 76, 129, 1852, 30, 3]
[2, 87, 83, 149, 50, 9, 56, 664, 85, 2512, 15, 3]

这边可以看到转换过後的资料会长什麽样子

最前面的2是开头，最後面的3是结尾

detokenize可以把ID转换回文字

round_trip = tokenizers.en.detokenize(encoded)
for line in round_trip.numpy():
  print(line.decode('utf-8'))

and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n ' t test for curiosity .

用lookup可以将token的ID转换成token的文字

tokens = tokenizers.en.lookup(encoded)
tokens

<tf.RaggedTensor [[b'[START]', b'and', b'when', b'you', b'improve', b'search', b'##ability', b',', b'you', b'actually', b'take', b'away', b'the', b'one', b'advantage', b'of', b'print', b',', b'which', b'is', b's', b'##ere', b'##nd', b'##ip', b'##ity', b'.', b'[END]'], [b'[START]', b'but', b'what', b'if', b'it', b'were', b'active', b'?', b'[END]'], [b'[START]', b'but', b'they', b'did', b'n', b"'", b't', b'test', b'for', b'curiosity', b'.', b'[END]']]>

从上面的token可以看到，这个方法会把一些词性相关的subword呈现出来

<<: Flutter基础介绍与实作-Day7 Hello Flutter(1)

>>: Day 8 : 字串处理