【Day 13】- 用 JSON 储存爬来的 PTT 文章。(实战 PTT 爬虫 3/3)

前情提要

前一篇文章带大家写了能爬取持续爬取 PTT 文章的爬虫。

开始之前

本篇将继续带各位写 PTT 爬虫，今天会将爬取到的文章内容用 JSON 档案储存起来。

预期效果

将爬取到的文章内容储存於 JSON 档案中。

实作

我们先定义一个存放所有文章资讯的串列

article_list = []

之後，我们将每一篇文章资讯存为一个字典，并将这个字典加入存放所有文章资讯的串列内。

title = art.find('div', class_='title').getText().strip()
if not title.startswith('(本文已被删除)'):
    link = 'https://www.ptt.cc' + \
        art.find('div', class_='title').a['href'].strip()
author = art.find('div', class_='author').getText().strip()
article = {
    'title': title,
    'link': link,
    'author': author
}

之後读者可以将存放所有文章资讯的串列输出看是否正常，这边统整一下目前的程序码。

import requests
from bs4 import BeautifulSoup

article_list = []

def get_resp(url):
    cookies = {
        'over18': '1'
    }
    resp = requests.get(url, cookies=cookies)
    if resp.status_code != 200:
        return 'error'
    else:
        return resp

def get_articles(resp):
    soup = BeautifulSoup(resp.text, 'html5lib')
    arts = soup.find_all('div', class_='r-ent')
    for art in arts:
        title = art.find('div', class_='title').getText().strip()
        if not title.startswith('(本文已被删除)'):
            link = 'https://www.ptt.cc' + \
                art.find('div', class_='title').a['href'].strip()
        author = art.find('div', class_='author').getText().strip()
        article = {
            'title': title,
            'link': link,
            'author': author
        }
        article_list.append(article)
        # print(f'title: {title}\nlink: {link}\nauthor: {author}')
    # 利用 Css Selector 定位下一页网址
    next_url = 'https://www.ptt.cc' + \
        soup.select_one(
            '#action-bar-container > div > div.btn-group.btn-group-paging > a:nth-child(2)')['href']
    return next_url

# 当执行此程序时成立
if __name__ == '__main__':
    # 第一个页面网址
    url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
    # 先让爬虫爬 10 页
    for now_page_number in range(10):
        print(f'crawing {url}')
        resp = get_resp(url)
        if resp != 'error':
            url = get_articles(resp)
        print(f'======={now_page_number+1}/10=======')
    print(article_list)

接下来，要将存放所有文章资讯的串列存为 JSON 档案，我们使用的是 Python 中的 json 库(内建)，记得将 json 引入。

import json

with open('ptt-articles.json', 'w', encoding='utf-8') as f:
    json.dump(article_list, f, indent=2,
              sort_keys=True, ensure_ascii=False)

再来将存为 JSON 档案的程序码加入爬虫专案当中。

import requests
import json
from bs4 import BeautifulSoup

article_list = []

def get_resp(url):
    cookies = {
        'over18': '1'
    }
    resp = requests.get(url, cookies=cookies)
    if resp.status_code != 200:
        return 'error'
    else:
        return resp

def get_articles(resp):
    soup = BeautifulSoup(resp.text, 'html5lib')
    arts = soup.find_all('div', class_='r-ent')
    for art in arts:
        title = art.find('div', class_='title').getText().strip()
        if not title.startswith('(本文已被删除)'):
            link = 'https://www.ptt.cc' + \
                art.find('div', class_='title').a['href'].strip()
        author = art.find('div', class_='author').getText().strip()
        article = {
            'title': title,
            'link': link,
            'author': author
        }
        article_list.append(article)
    # 利用 Css Selector 定位下一页网址
    next_url = 'https://www.ptt.cc' + \
        soup.select_one(
            '#action-bar-container > div > div.btn-group.btn-group-paging > a:nth-child(2)')['href']
    return next_url

# 当执行此程序时成立
if __name__ == '__main__':
    # 第一个页面网址
    url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
    # 先让爬虫爬 10 页
    for now_page_number in range(10):
        print(f'crawing {url}')
        resp = get_resp(url)
        if resp != 'error':
            url = get_articles(resp)
        print(f'======={now_page_number+1}/10=======')
    # 将存放所有文章资讯的串列存於 JSON 档案中
    with open('ptt-articles.json', 'w', encoding='utf-8') as f:
        json.dump(article_list, f, indent=2,
                  sort_keys=True, ensure_ascii=False)

成功爬取连续多页面 ptt 文章资讯并存於 JSON 档中。