[Python]Natural Language Toolkit

http://www.nltk.org/
NLTK 是一个主流用於自然语言处理的 Python 库

import nltk
nltk.download()

pip3 install html5lib

Collecting html5lib
  Downloading https://files.pythonhosted.org/packages/6c/dd/a834df6482147d48e225a49515aabc28974ad5a4ca3215c18a882565b028/html5lib-1.1-py2.py3-none-any.whl (112kB)
     |████████████████████████████████| 112kB 328kB/s
Collecting webencodings
  Downloading https://files.pythonhosted.org/packages/f4/24/2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7/webencodings-0.5.1-py2.py3-none-any.whl
Requirement already satisfied: six>=1.9 in c:\python37\lib\site-packages (from html5lib) (1.13.0)
Installing collected packages: webencodings, html5lib
Successfully installed html5lib-1.1 webencodings-0.5.1

import nltk

#使用 NLTK 删除停止词
from nltk.corpus import stopwords

#使用 urllib模组来抓取网页
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
print (html)

b'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en">\n<head>\n\n
.
.
.
<span id="toTopHover"></span><img width="40" height="40" alt="To Top" src="/images/[email protected]"></a>\n\n</body>\n</html>\n'

#去掉HTML标记，将抓取的网页转换为乾净的文字
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

PHP: Hypertext PreprocessorDownloadsDocumentationGet InvolvedHelpGetting StartedIntroductionA simple tutorialLanguage ReferenceBasic syntaxTypesVariablesConstantsExpressionsOperatorsControl
.
.
.
The list of changes is recorded in theChangeLog.Older News EntriesUpcoming conferencesPHP Conference China 2020International PHP Conference Munich 2020International PHP Conference Berlin 2020PHP.RUHR 2020 - Web Development & Digital CommerceUser Group EventsSpecial ThanksSocial media@official_phpCopyright © 2001-2020 The PHP GroupMy PHP.netContactOther PHP.net sitesPrivacy policy

#将文字分词
tokens = [t for t in text.split()]
print (tokens)

['PHP:', 'Hypertext', 'PreprocessorDownloadsDocumentationGet', 'InvolvedHelpGetting', 'StartedIntroductionA', 'simple', 'tutorialLanguage', 'ReferenceBasic',...
..., 'media@official_phpCopyright', '©', '2001-2020', 'The', 'PHP', 'GroupMy', 'PHP.netContactOther', 'PHP.net', 'sitesPrivacy', 'policy']

#通过对列表中的标记进行遍历并删除其中的停止词
clean_tokens = tokens[:]
sr = stopwords.words('english')
for token in tokens:
  if token in stopwords.words('english'):
    clean_tokens.remove(token)

#使用 Python NLTK 来计算每个词的出现频率。NLTK 中的FreqDist( ) 函式可以实现词频统计的功能
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
  print (str(key) + ':' + str(val))

PHP::1
Hypertext:1
PreprocessorDownloadsDocumentationGet:1
InvolvedHelpGetting:1
StartedIntroductionA:1
simple:1
tutorialLanguage:1
ReferenceBasic:1
syntaxTypesVariablesConstantsExpressionsOperatorsControl:1
StructuresFunctionsClasses:1
and:45
.
.
.
2001-2020:1
GroupMy:1
PHP.netContactOther:1
PHP.net:1
sitesPrivacy:1
policy:1

<<: Day 30-ASP.NET & SQL资料库制作留言板(下)

>>: [Day 29] 为什麽欧洲我最喜欢的是义大利