Day 26 Dcard热门文章爬取

今天要来爬取另一个知名论坛—Dcard
比起昨天的批踢踢,爬取Dcard论坛的过程会稍微复杂一些些
但了解其中的奥妙後,大部分的网站都可以顺利的爬取啦~

以下为影片中有使用到的程序码

import requests, bs4

url = "https://www.dcard.tw/f"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'}
htmlfile = requests.get(url, headers = headers)
objsoup = bs4.BeautifulSoup(htmlfile.text, 'lxml')

articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章编号:", number)
    print("文章标题:", title.text)
    print("心情数量:", emotion.text)
    print("留言数量:", comment.text)
    print("="*100)
#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
import bs4
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)

objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章编号:", number)
    print("文章标题:", title.text)
    print("心情数量:", emotion.text)
    print("留言数量:", comment.text)
    print("="*100)
#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time

dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)


move = browser.find_element_by_tag_name('body')
time.sleep(3)
move.send_keys(Keys.PAGE_DOWN) 
time.sleep(3)

objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章编号:", number)
    print("文章标题:", title.text)
    print("心情数量:", emotion.text)
    print("留言数量:", comment.text)
    print("="*100)
#请将C:\\spider\\修改为chromedriver.exe在您电脑中的路径
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time

page = int(input("请输入页面向下卷动次数:"))
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)


number = 0
counter = 0
post_title = []

while page > counter:
    move = browser.find_element_by_tag_name('body')
    time.sleep(1)
    move.send_keys(Keys.PAGE_DOWN) 
    time.sleep(1)

    objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
    articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')



    for article in articles:
        title = article.find('a')
        emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
        comment = article.find('div', class_ = 'uj732l-2 ghvDya')
        
        if title.text not in post_title:
            number += 1
            post_title.append(title.text)
            print("文章编号:", number)
            print("文章标题:", title.text)
            print("心情数量:", emotion.text)
            print("留言数量:", comment.text)
            print("="*100)
            
    counter += 1
    
print(post_title)

本篇影片及程序码仅提供研究使用,请勿大量恶意地爬取资料造成对方网页的负担呦!
如果在影片中有说得不太清楚或错误的地方,欢迎留言告诉我,谢谢您的指教。


<<:  【设计+切版30天实作】|Day26 - Reviews区块 - 卡片可以因应不同用途而千变万化

>>:  [ Day 25 ] 实作一个 React.js 网站 1/5

[Day7]Where子句

前几篇介绍了Select语句,接下来会以相同模式介绍Where子句。 Where子句格式: Wher...

Java:观念厘清(新手用)-单元运算子a++与++a的差异

本篇用记录笔者在上课时,笔记a++与++a的差异。 单看结果虽然都是一样,但是搭配其他运算及操作时,...

你怎麽看登录档清理这回事--用WiseRegCleaner解释给你看

写着写着来到12天,不知不觉间过了三分之一了,一样我们根据上篇预告,笔者今天会用先前略提的WiseR...

D8. 学习基础C、C++语言

D8. 题目练习(uva11777) #include <stdio.h> #inclu...

虹语岚访仲夏夜-22(专业的小四篇)

「你又在看什麽? 我已经气到不想跟你说话了。」 『我才气好吗?  别生气啦...我把现在这个看完好吗...