用 Python 畅玩 Line bot - 27：爬虫（二）

接续前篇，一般爬虫时抓出的资料量多还没什麽关系，但这次我们是想要让使用者在 line 上使用，一次给太多资料总会造成使用者困扰，所以我们需要将抓出来的资料做数量上的限制。

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.atmovies.com.tw/movie/new/')
r.encoding = 'utf-8'

soup = BeautifulSoup(r.text, 'lxml')

filmTitle = soup.select('div.filmTitle a')

content = ""

for data in enumerate(filmTitle):
    if i > 10:
        break
    content += data.text + "\n" + "http://www.atmovies.com.tw/" + data['href'] + "\n\n"

print(content)

这边是使用 for 回圈去做数量上的限制，我们也有其他种写法可以来做到同样效果，例如只取list 的前十个内容物。

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.atmovies.com.tw/movie/new/')
r.encoding = 'utf-8'

soup = BeautifulSoup(r.text, 'lxml')

filmTitle = soup.select('div.filmTitle a')[:10]

print(filmTitle)

或者改用find_all()中的limit限制

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.atmovies.com.tw/movie/new/')
r.encoding = 'utf-8'

soup = BeautifulSoup(r.text, 'lxml')

filmTitle = soup.find_all('div', class="filmTitle", limit=10)
print(filmTitle)

再来就是最後一部，要将爬虫加入 line bot 内

import os

from flask import Flask, request, abort

from linebot import (
    LineBotApi, WebhookHandler
)
from linebot.exceptions import (
    InvalidSignatureError
)
from linebot.models import *

import requests
from bs4 import BeautifulSoup

app = Flask(__name__)

line_bot_api = LineBotApi(os.environ.get('CHANNEL_ACCESS_TOKEN'))
handler = WebhookHandler(os.environ.get('CHANNEL_SECRET'))


@app.route("/callback", methods=['POST'])
def callback():
    # get X-Line-Signature header value
    signature = request.headers['X-Line-Signature']

    # get request body as text
    body = request.get_data(as_text=True)
    app.logger.info("Request body: " + body)

    # handle webhook body
    try:
        handler.handle(body, signature)
    except InvalidSignatureError:
        print("Invalid signature. Please check your channel access token/channel secret.")
        abort(400)

    return 'OK'

@handler.add(MessageEvent, message=TextMessage)
def handle_message(event):
    if event.message.text == '本周新片':
        r = requests.get('http://www.atmovies.com.tw/movie/new/')
        r.encoding = 'utf-8'

        soup = BeautifulSoup(r.text, 'lxml')
        content = []
        for i, data in enumerate(soup.select('div.filmTitle a')):
            if i > 20:
                break
            content.append(data.text + '\n' + 'http://www.atmovies.com.tw' + data['href'])

        line_bot_api.reply_message(
            event.reply_token,
            TextSendMessage(text='\n\n'.join(content))
        )

if __name__ == "__main__":
    app.run()
# 若没有这部分就要设定环境变数让 FLASK_APP = app.py 之类的

<<: Domain Storytelling - 简单的方法说出一个Domain story

>>: Last Night in Soho线上