延续上一篇( 连结 ) 做法,我们换一个资料库,试试看NLP BM25 的搜寻功能如何。
资料库来源:COVID-19 metadata.csv download from Kaggle
Dataset Description资料库说明:
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 500,000 scholarly articles, including over 200,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
全部资料有54GB,我们只取一个csv档 metadata.csv 包含63,572篇论文。栏位结构如下:
程序码和前一篇雷同,请直接GitHub下载,这里不再po
我们搜寻摘要栏位 abstract ,关键字三个 Taiwan COVID vaccine
搜寻完成後,列出BM25分数最高的前5篇,存档。
结果如下:第一篇 有关键字Taiwan vaccine,但没有COVID
其它篇,有COVID或Taiwan 或vaccine
把第一篇的abstract--> wordcloud-->存图-->LSA summary3句话
<<: 【Day27】反馈元件 - Progress circle
在传统SQL有交易(Transaction)功能, 这次实作NoSQL类似的功能. 由於在Parti...
流程控制 if..else if...else 可以办到到的是,「当符合条件,就自动执行程序」,语法...
这是第一次铁人赛完赛,很开心可以督促自己连续30天写文章(虽然写的都是些不完整的东西XD) 这 30...
曾有前辈说过,要以终为始。 程序结束在哪,就从那边开始追。 睡了两天,就在礼拜一列下想做的事情, 接...
开开始学後端的人多少会听到ASP.NET、.NET Framework、.NET Core,但不清楚...