一种分析某单词在文章中重要程度公式
TF-IDF值与档案中出现次数成正比,语料库出现频率成反比
指某词语在档案中的出现频率
ni,j:该字词在档案中出现次数
Σni,k:档案中字词数量
指某词语在文章中的重要性
D:档案数量
1+|j:ti dj|:含有ti词语的档案数量
1:避免分母为0
假设一篇文章总共有100个词语,而「大角怪」出现了5次,
而「大角怪」在1,000篇文章出现,文章数量总共有10,000,000篇。
from math import log
def tf(term, doc, normalize=True):
doc = doc.lower().split()
if (normalize):
result = doc.count(term.lower())/float(len(doc))
else:
result = doc.count(term.lower())/1
return result
def idf(term, docs):
num_text_with_term = len(
[True for doc in docs if term.lower() in doc.lower().split()])
try:
return 1.0 + log(len(docs) / num_text_with_term)
except ZeroDivisionError:
return 1.0
def tf_idf(term, doc, docs):
return tf(term, doc)*idf(term, docs)
corpus = \
{'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.',
'b': 'Professor Plumb has a green plant in his study ',
'c': "Miss Scarlett watered Professor Plumb's green plant while he was away from his office last week."}
## i.lower()=>转小写
## split()=>分割
QUERY_TERMS = ['green']
for term in [t.lower() for t in QUERY_TERMS]:
for doc in sorted(corpus):
print('TF(%s): %s' % (doc, term), tf(term, corpus[doc]))
print('IDF: %s' % (term, ), idf(term, corpus.values()),"\n")
for doc in sorted(corpus):
score = tf_idf(term, corpus[doc], corpus.values())
print('TF-IDF(%s): %s' % (doc, term), score,"\n")
# 将tf*idf相加
import nltk
terms = "Develop daily routines before and after school—for example, things to pack for school in the morning (like hand sanitizer and a backup mask) and things to do when you return home (like washing hands immediately and washing worn cloth masks). Wash your hands immediately after taking off a mask.People who live in multi-generational households may find it difficult to take precautions to protect themselves from COVID-19 or isolate those who are sick, especially if space in the household is limited and many people live in the same household. CDC recently created guidance for multi-generational households. Although the guidance was developed as part of CDC’s outreach to tribal communities, the information could be useful for all families, including those with both children and older adults in the same home."
text = [text for text in terms.split()]
## 断词处理,存为列表
tc = nltk.TextCollection(text)
## 放入nltk的套件处理
term = 'a'
## 搜寻字
idx = 0
print('TF(%s): %s' % ('a', term), tc.tf(term, text[idx]))
# If a term does not appear in the corpus, 0.0 is returned.
print('IDF(%s): %s' % ('a', term), tc.idf(term))
print ('TF-IDF(%s): %s' % ('a', term), tc.tf_idf(term, text[idx]))
<<: Golang-Channel & Goroutine-基础篇
昨天教到如何下载以及使用疯狂程设,今天就来试着做一题练习题吧。 登入疯狂程设,点选CPE考古题就会出...
人生苦短 我用 Python 目录 前言 : 数据持久化 ? 使用 SQLite 连线 MySQL ...
前言: 今天要来让大家做一个小型的成果发表,看完接下来的内容後,希望大家都可以学到怎麽让大家连上你...
Mholweni,我是Charlie! 在Day24当中,我们完成了订单详情的部分,而今天我们将回头...
#odoo #开源系统 #数位赋能 #E化自主 下列资讯,由元植管顾同意,节录自元植odoo快速入门...