[神经机器翻译理论与实作] 从头建立英中文翻译器 (II)

前言

今天继续建立英翻中神经网络的实作。

翻译器建立实作

建立资料集（续）

首先引入必要的模组以及函式：

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

先载入昨天建立好的 seq_pairs ，并拆分成英文（ source language ）与中文（ target language ）文句：

with open("data/eng-cn.pkl", "rb") as f:
    seq_pairs = pkl.load(f)

src_sentences = [pair[0] for pair in seq_pairs]
tgt_sentences = [pair[1] for pair in seq_pairs]

我们分别创造中文和英文的断词器（ tokenisers ）：

def create_tokeniser(sentences):
    # create a tokeniser specific to texts
    tokeniser = Tokenizer(filters = ' ')
    tokeniser.fit_on_texts(sentences)
    # preview the first 3 sentences versus their word tokenised versions
    for i in range(3):
        print("original: {} - word tokenised: {}".format(sentences[i], tokeniser.texts_to_sequences(sentences)[i]))
    return tokeniser.texts_to_sequences(sentences), tokeniser
    
# word tokenise source and target sentences
src_word_tokenised, src_tokeniser = create_tokeniser(src_sentences)
tgt_word_tokenised, tgt_tokeniser = create_tokeniser(tgt_sentences)

整理中文和英文的词汇表以及词汇总量：

# source and target vocabulary dictionaries
src_vocab_dict = src_tokeniser.word_index
tgt_vocab_dict = tgt_tokeniser.word_index

src_vocab_size = len(src_vocab_dict) + 1 # 6819 tokens in total
tgt_vocab_size = len(tgt_vocab_dict) + 1 # 3574 tokens in total

再来计算出词条序列的最大长度：

src_max_seq_length = len(max(src_word_tokenised, key = len)) # 38
tgt_max_seq_length = len(max(tgt_word_tokenised, key = len)) # 46

为了制造训练特徵向量 X 以及标签向量 y ，我们须先将每句不足的部分补0。这时候使用稍早引入的函式 pad_sequences() ，值得注意的是引数 padding 要选择 "post" 才能0使得从序列尾端补足：

src_sentences_padded = pad_sequences(src_word_tokenised, maxlen = src_max_seq_length, padding = "post")  # shape: (26388, 38)
tgt_sentences_padded = pad_sequences(tgt_word_tokenised, maxlen = tgt_max_seq_length, padding = "post")  # shape: (26388, 46)

# increase 1 dimension
src_sentences_padded = src_sentences_padded.reshape(*src_sentences_padded.shape, 1) # shape: (26388, 38, 1)
tgt_sentences_padded = tgt_sentences_padded.reshape(*tgt_sentences_padded.shape, 1) # shape: (26388, 46, 1)

今天的进度先到这里，明天继续建立资料集，晚安！

[神经机器翻译理论与实作] 从头建立英中文翻译器 (II)

前言

翻译器建立实作

建立资料集（续）

阅读更多

Day 23 用户资料数据下载定义规划实作

【PHP 设计模式大头菜】模板方法 Template Method

Leetcode 挑战 Day 08 [191. Number of 1 Bits]

单元测试-概念

Docker基础功能教学

一键更新HTTPS凭证 - Automation Accounts

[进阶指南] Portal（ Day26 ）

Unity自主学习(十二)：认识Unity介面(3)

【Day2】应用上大致的规划

IOS、Python自学心得30天 Day-22 MacOS训练模组