NLP-台语罗马字: Word Embeddings Using Keras

在 Word Embeddings 中，每一个字都以一个n维度的稠密向量(n-dimensional dense vector)来表示；相似意义的字会有相似的向量。

执行 Word Embeddings，我们可以使用 Keras library 的 Embedding()。Embedding() 可以客制Word Embeddings或载入已训练的Word Embeddings。

embedding_layer = Embedding(200, 32, input_length=50)

第一个参数－字汇数目或文章中 unique words 数目。
第二个参数－每个字汇向量的维度。
第二个参数－每个输入(input)句子的长度。
Embedding()会产生一个2D向量(2D vector)，列代表字汇，行显示相对应的维度。

Embedding() 可以学习客制(custom) Word Embeddings 或载入已训练的 Word Embeddings。这里我们将学习客制(custom) Word Embeddings。

首先，载入相关的 libraries：

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.embeddings import Embedding

我们将使用这个资料集：

corpus = [
    # Positive Reviews

    'tse sī tsi̍t tshut tshut-sik ê tiān-iánn',   # This is an excellent movie
    'tse tiān-iánn tsiok tsán guá kah-ì',        # The move was fantastic I like it
    
    # Negtive Reviews
    'khióng-pòo ê tshut-ián',    # horrible acting
    'guá bô kah-ì tse tiān-iánn',  # I did not like the movie
    'tse tiān-iánn si̍t-tsāi-sī khióng-pòo', # The movie was horrible
   
]

移出字汇中的'-'，若不移除 Embedding() 会将这些字在拆成两个字。

import re
from nltk.tokenize import word_tokenize
# remove -
def remove_re(corpus):
    results = []
    for text in corpus:
        text = re.sub(r'-', "", text)
        results.append(text)
    return results
corpus = remove_re(corpus)

计算文章中 unique word 数目。

all_words = []
for sent in corpus:
    tokenize_word = word_tokenize(sent)
    for word in tokenize_word:
        all_words.append(word)
       
unique_words = set(all_words)
print(len(unique_words))

接着，我们必须把字汇转换成数字才能被 Embedding() 读取。使用 keras.preprocessing.text library 中的 one_hot 函数。

vocab_length = 30
embedded_sentences = [one_hot(sent, vocab_length) for sent in corpus]
print(embedded_sentences)

[[3, 25, 29, 15, 24, 1, 23], [3, 23, 22, 28, 17, 6], [7, 1, 20], [17, 23, 6, 3, 23], [3, 23, 2, 7]]
我们可以看到第一个句子有7个字，所以有7个整数在第一个 list 项目上。

再来，必须设定句子长度，将空白的indexes填补上0，这样句子才能变成等长，才能被 Embedding() 读取。使用 pad_sequences() 函数。

padded_sentences = pad_sequences(embedded_sentences, maxlen=12, padding='post') 
print(padded_sentences)

[[ 3 25 29 15 24 1 23 0 0 0 0 0]
[ 3 23 22 28 17 6 0 0 0 0 0 0]
[ 7 1 20 0 0 0 0 0 0 0 0 0]
[17 23 6 3 23 0 0 0 0 0 0 0]
[ 3 23 2 7 0 0 0 0 0 0 0 0]]

现在我们可以建立 model 了。

model = Sequential()
model.add(Embedding(vocab_length, 2, input_length= length_long_sentence))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

检视 Embedding() 产生的 2D vector

output = model.predict(padded_sentences)
print(output)

[[[-0.01115279 -0.0053846 0.0145705 0.01441126 -0.01934116]
[ 0.01724459 0.03577454 0.02544147 0.0369082 0.02247829]
[-0.00657413 0.04421231 0.03926947 0.01498995 0.00432252]
[-0.01672726 0.04325547 -0.01818988 0.01232086 0.03949806]
[ 0.03714544 -0.03660127 0.03566999 -0.03256686 0.03914088]
[-0.00261252 0.01996125 -0.03446733 -0.01299053 0.00557587]
[ 0.01985036 0.02891095 0.04272795 0.03223069 0.01777556]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]]
....
....
....
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]
[-0.02478491 0.03551097 -0.03647963 -0.0455409 -0.04592093]]]

接下来，我们就可以用这些资料做深度学习了。

<<: 基础的Git上传方法与指令

>>: Fix McAfee Activate Product Key Issue by 1877-956-4555

NLP-台语罗马字: Word Embeddings Using Keras

[Deploy to Render] 用免费方案部署 LINE Bot

[ Day 30 ] - 初学者升级啦～完赛心得

JAVA 语言

Android Studio初学笔记-Day9-BMI计算器

[Day 20] 资料标注 (1/2) — Forget about the price tag ♫

Gulp 压缩优化程序码(1) DAY88

同步、非同步事件控制

[day26] - Angular Component to Web Component

Day23 韩式安东炖鸡

Python 学习笔记