[Python]Natural Language Toolkit

http://www.nltk.org/
NLTK 是一个主流用於自然语言处理的 Python 库

import nltk
nltk.download()

https://ithelp.ithome.com.tw/upload/images/20201012/20119608XCBfXhAwoT.jpg

pip3 install html5lib
Collecting html5lib
  Downloading https://files.pythonhosted.org/packages/6c/dd/a834df6482147d48e225a49515aabc28974ad5a4ca3215c18a882565b028/html5lib-1.1-py2.py3-none-any.whl (112kB)
     |████████████████████████████████| 112kB 328kB/s
Collecting webencodings
  Downloading https://files.pythonhosted.org/packages/f4/24/2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7/webencodings-0.5.1-py2.py3-none-any.whl
Requirement already satisfied: six>=1.9 in c:\python37\lib\site-packages (from html5lib) (1.13.0)
Installing collected packages: webencodings, html5lib
Successfully installed html5lib-1.1 webencodings-0.5.1
import nltk
#使用 NLTK 删除停止词
from nltk.corpus import stopwords
#使用 urllib模组来抓取网页
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
print (html)
b'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en">\n<head>\n\n
.
.
.
<span id="toTopHover"></span><img width="40" height="40" alt="To Top" src="/images/[email protected]"></a>\n\n</body>\n</html>\n'
#去掉HTML标记,将抓取的网页转换为乾净的文字
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)
PHP: Hypertext PreprocessorDownloadsDocumentationGet InvolvedHelpGetting StartedIntroductionA simple tutorialLanguage ReferenceBasic syntaxTypesVariablesConstantsExpressionsOperatorsControl
.
.
.
The list of changes is recorded in theChangeLog.Older News EntriesUpcoming conferencesPHP Conference China 2020International PHP Conference Munich 2020International PHP Conference Berlin 2020PHP.RUHR 2020 - Web Development & Digital CommerceUser Group EventsSpecial ThanksSocial media@official_phpCopyright © 2001-2020 The PHP GroupMy PHP.netContactOther PHP.net sitesPrivacy policy
#将文字分词
tokens = [t for t in text.split()]
print (tokens)
['PHP:', 'Hypertext', 'PreprocessorDownloadsDocumentationGet', 'InvolvedHelpGetting', 'StartedIntroductionA', 'simple', 'tutorialLanguage', 'ReferenceBasic',...
..., 'media@official_phpCopyright', '©', '2001-2020', 'The', 'PHP', 'GroupMy', 'PHP.netContactOther', 'PHP.net', 'sitesPrivacy', 'policy']
#通过对列表中的标记进行遍历并删除其中的停止词
clean_tokens = tokens[:]
sr = stopwords.words('english')
for token in tokens:
  if token in stopwords.words('english'):
    clean_tokens.remove(token)
#使用 Python NLTK 来计算每个词的出现频率。NLTK 中的FreqDist( ) 函式可以实现词频统计的功能
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
  print (str(key) + ':' + str(val))
PHP::1
Hypertext:1
PreprocessorDownloadsDocumentationGet:1
InvolvedHelpGetting:1
StartedIntroductionA:1
simple:1
tutorialLanguage:1
ReferenceBasic:1
syntaxTypesVariablesConstantsExpressionsOperatorsControl:1
StructuresFunctionsClasses:1
and:45
.
.
.
2001-2020:1
GroupMy:1
PHP.netContactOther:1
PHP.net:1
sitesPrivacy:1
policy:1

<<:  Day 30-ASP.NET & SQL资料库制作留言板(下)

>>:  [Day 29] 为什麽欧洲我最喜欢的是义大利

# Day 9 Cache and TLB Flushing Under Linux (一)

如同 Day1 简介的,这份文件是之前工作中有碰过 cache & TLB 相关的项目,但是...

Day-23 AVL Tree

树的高度(height of the tree) 在Binary Search tree中,我们知道...

[Day_23]函式与递回_(2)

函式与变数的作用范围 变数作用范围分成全域变数与函式内的区域变数,宣告在最上面最外层的称作全域变数,...

Day 24 - Single Number

大家好,我是毛毛。ヾ(´∀ ˋ)ノ 废话不多说开始今天的解题Day~ 136. Single Num...

Day12. 台风天神出鬼没的Blue Prism-BP合并表格结果(改良版)

试想:当员工遇不可归责之临时状况需要离开工作现场, 若员工原本的进度还能持续进行, 那该多好,以下就...