今天要来爬取另一个知名论坛—Dcard
比起昨天的批踢踢,爬取Dcard论坛的过程会稍微复杂一些些
但了解其中的奥妙後,大部分的网站都可以顺利的爬取啦~
以下为影片中有使用到的程序码
import requests, bs4
url = "https://www.dcard.tw/f"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'}
htmlfile = requests.get(url, headers = headers)
objsoup = bs4.BeautifulSoup(htmlfile.text, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')
number = 0
for article in articles:
title = article.find('a')
emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
comment = article.find('div', class_ = 'uj732l-2 ghvDya')
number += 1
print("文章编号:", number)
print("文章标题:", title.text)
print("心情数量:", emotion.text)
print("留言数量:", comment.text)
print("="*100)
#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
import bs4
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)
objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')
number = 0
for article in articles:
title = article.find('a')
emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
comment = article.find('div', class_ = 'uj732l-2 ghvDya')
number += 1
print("文章编号:", number)
print("文章标题:", title.text)
print("心情数量:", emotion.text)
print("留言数量:", comment.text)
print("="*100)
#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)
move = browser.find_element_by_tag_name('body')
time.sleep(3)
move.send_keys(Keys.PAGE_DOWN)
time.sleep(3)
objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')
number = 0
for article in articles:
title = article.find('a')
emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
comment = article.find('div', class_ = 'uj732l-2 ghvDya')
number += 1
print("文章编号:", number)
print("文章标题:", title.text)
print("心情数量:", emotion.text)
print("留言数量:", comment.text)
print("="*100)
#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time
page = int(input("请输入页面向下卷动次数:"))
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)
number = 0
counter = 0
post_title = []
while page > counter:
move = browser.find_element_by_tag_name('body')
time.sleep(1)
move.send_keys(Keys.PAGE_DOWN)
time.sleep(1)
objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')
for article in articles:
title = article.find('a')
emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
comment = article.find('div', class_ = 'uj732l-2 ghvDya')
if title.text not in post_title:
number += 1
post_title.append(title.text)
print("文章编号:", number)
print("文章标题:", title.text)
print("心情数量:", emotion.text)
print("留言数量:", comment.text)
print("="*100)
counter += 1
print(post_title)
本篇影片及程序码仅提供研究使用,请勿大量恶意地爬取资料造成对方网页的负担呦!
如果在影片中有说得不太清楚或错误的地方,欢迎留言告诉我,谢谢您的指教。
<<: 【设计+切版30天实作】|Day26 - Reviews区块 - 卡片可以因应不同用途而千变万化
>>: [ Day 25 ] 实作一个 React.js 网站 1/5
前几篇介绍了Select语句,接下来会以相同模式介绍Where子句。 Where子句格式: Wher...
本篇用记录笔者在上课时,笔记a++与++a的差异。 单看结果虽然都是一样,但是搭配其他运算及操作时,...
写着写着来到12天,不知不觉间过了三分之一了,一样我们根据上篇预告,笔者今天会用先前略提的WiseR...
D8. 题目练习(uva11777) #include <stdio.h> #inclu...
「你又在看什麽? 我已经气到不想跟你说话了。」 『我才气好吗? 别生气啦...我把现在这个看完好吗...