Day 26 Dcard热门文章爬取

今天要来爬取另一个知名论坛—Dcard
比起昨天的批踢踢，爬取Dcard论坛的过程会稍微复杂一些些
但了解其中的奥妙後，大部分的网站都可以顺利的爬取啦～

以下为影片中有使用到的程序码

import requests, bs4

url = "https://www.dcard.tw/f"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'}
htmlfile = requests.get(url, headers = headers)
objsoup = bs4.BeautifulSoup(htmlfile.text, 'lxml')

articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章编号:", number)
    print("文章标题:", title.text)
    print("心情数量:", emotion.text)
    print("留言数量:", comment.text)
    print("="*100)

#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
import bs4
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)

objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章编号:", number)
    print("文章标题:", title.text)
    print("心情数量:", emotion.text)
    print("留言数量:", comment.text)
    print("="*100)

#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time

dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)


move = browser.find_element_by_tag_name('body')
time.sleep(3)
move.send_keys(Keys.PAGE_DOWN) 
time.sleep(3)

objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章编号:", number)
    print("文章标题:", title.text)
    print("心情数量:", emotion.text)
    print("留言数量:", comment.text)
    print("="*100)

#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time

page = int(input("请输入页面向下卷动次数:"))
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)


number = 0
counter = 0
post_title = []

while page > counter:
    move = browser.find_element_by_tag_name('body')
    time.sleep(1)
    move.send_keys(Keys.PAGE_DOWN) 
    time.sleep(1)

    objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
    articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')



    for article in articles:
        title = article.find('a')
        emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
        comment = article.find('div', class_ = 'uj732l-2 ghvDya')
        
        if title.text not in post_title:
            number += 1
            post_title.append(title.text)
            print("文章编号:", number)
            print("文章标题:", title.text)
            print("心情数量:", emotion.text)
            print("留言数量:", comment.text)
            print("="*100)
            
    counter += 1
    
print(post_title)

本篇影片及程序码仅提供研究使用，请勿大量恶意地爬取资料造成对方网页的负担呦！
如果在影片中有说得不太清楚或错误的地方，欢迎留言告诉我，谢谢您的指教。

<<: 【设计+切版30天实作】｜Day26 - Reviews区块 - 卡片可以因应不同用途而千变万化

>>: [ Day 25 ] 实作一个 React.js 网站 1/5