【文字分析】3-5 词嵌入模型

说明

藉由马可夫模型的机率特性，将语言中的最小单位-词语组成句子,再由句子组为文章

马可夫模型:一种纪录机率的统计模型，已任一状态为起点，
           依照机率走放到下一阶段，走访完成时机率会为100%

程序实作

引入

with open("test.txt", encoding="utf-8") as f:
    sentences = f.readlines()

sentences = [s.strip() for s in sentences]

引入程序文件并分割

断句

import re
import string

delims = [
    "，", "。", "；", "：", "！",
    "?", "？", ";", ":", "!",
    ",", ".", "\"", "'", "“",
    "‘", "’", "(", ")", "”",
    "（", "）", "%", "％", "@",
    "~", "`", "～", "｀", "#",
    "、", "/", "\\", "<", ">",
    "《", "》", "／", "｛", "｝",
    "{", "}", "[", "]", "［",
    "］", "|", "｜", "\n", "\r",
    " ", "\t", "　", '+', '=', '*', '^', '·'
]\
    + list("0123456789") \
    + list(string.punctuation)
escaped = re.escape(''.join(delims))
exclusions = '['+escaped+']'
## 断句字典

splitsen = []
for s in sentences:
    cleans = re.sub(exclusions, ' ', s)
    subs = cleans.split()
    splitsen.extend(subs)
## 加入空格後切割 最後存入阵列


for idx, s in enumerate(splitsen):
    splitsen[idx] = 'S'+s+"E"

## 头尾加分隔符号

断词

jieba.load_userdict('dict.txt.big')
words = []
for s in splitsen:
    ws = list(jieba.cut(s))
    words.extend(ws)

建立词典

def build_word_dict(words):

    word_dict = {}
    for i in range(1, len(words)):
        if words[i-1] not in word_dict:
            word_dict[words[i-1]] = {}
        if words[i] not in word_dict[words[i-1]]:
            word_dict[words[i-1]][words[i]] = 0
        word_dict[words[i-1]][words[i]] += 1

    return word_dict

word_dict = build_word_dict(words)

print(words)
print(word_dict["人"])

随机组合

# 算总次数

def wordListSum(wordList):
    sumfreq = 0
    for word, freq in wordList.items():
        sumfreq += freq
    #print(sumfreq)
    return sumfreq

# 依照机率分布，随机产生下一个字


def retrieveRandomWord(wordList):
    #print(wordList)
    # 1~n 取乱数
    randIndex = randint(1, wordListSum(wordList))
    for word, freq in wordList.items():
        randIndex -= freq
        if randIndex <= 0:
            return word


# 产生长度100的Markov chain
length = 100
chain = ""
currentWord = "生活"
for i in range(0, length):
    chain += currentWord+""
    print(currentWord,"=>",word_dict[currentWord])
    currentWord = retrieveRandomWord(word_dict[currentWord])

#print(chain)

去句首尾

import re
reply = re.split('S|E',chain)
reply = [s for s in reply if s != '']
for x in reply:
    print(x)

<<: 【文字分析】3-4 TF-IDF文字概念

>>: Day 24 [编程03] [译文] 如何在JavaScript 中更好地使用数组

[前端暴龙机，Vue2.x 进化 Vue3 ] Day1.在认识vue之前(一)

杂谈

Day21 TensorFlow&OpenCV简介

杂谈

Day10 用python写UI-聊聊文字方块Entry

杂谈

Day24 - 关於共识演算法与容错机制

杂谈

Day30：总结

杂谈

Day25【Web】TCP 连线与断线：三次握手、四次挥手

TCP 是一种要求资料正确性的传输方式，这表示它需要一些特殊机制，来确保传输的数据不会出错。其...

Dungeon Mizarka 005

UI版面配置几近年的FPDC游戏，单角色的控制多以First Preson Shooter玩法为主...

AI ninja project [day 5] AI RPA系统--表单篇

再来是办公室表单的处理，假设有些表单只有图像或是只有纸本，想要汇入成Excel档案时，我们就可...

Day 21: Informix(2)

Day 21: Informix(2) 环境设定查询自己帐号预设的登入 shell 设定对於 I...

D27 - 如何用 Apps Script 自动化地创造与客制 Google Sheet？（四）蒐集大量试算表中的回应

今天的目标：要怎麽样依照范本复制并改动 Google Sheet，并一次性地的将结果搜集到同一份 ...