使用 Python 实作网路爬虫 part 3

实际操作

了解 requests 与 BeautifulSoup 的功能後,我们来进行整合吧!接下来我们会以 cookpad 这个料理网站来进行爬虫

import requests
from bs4 import BeautifulSoup

response = requests.get(https://cookpad.com/tw)
print(response.status_code)

soup = BeautifulSoup(response.content, "html.parser")
print(soup)
403

<!DOCTYPE html>

<html dir="ltr" lang="zh-TW">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>很抱歉,此页面读取失败。</title>
<script>
          if (typeof Turbolinks !== 'undefined') {
            location.reload();
          }
        </script>
<!-- Px -->
<script>
          window._pxAppId = 'PXFqtAw5et';
          window._pxJsClientSrc = '/FqtAw5et/init.js';
          window._pxFirstPartyEnabled = true;
          window._pxVid = '';
          window._pxUuid = '32b1303b-2908-11ec-af10-7267594b6b65';
          window._pxHostUrl = '/FqtAw5et/xhr';
          window._PXFqtAw5et = {
            locale: 'zh-TW',
            translation: {
              'zh-TW': [
        {
            "selector": "#px-form-head span",
            "text": "遇到问题 ? 请提供更多资讯"
        },
        {
            "selector": "#px-form div label[for=opt1]",
            "text": "我没有看到任何验证码"
        },
        {
            "selector": "#px-form div label[for=opt2]",
            "text": "我已解决验证码问题,但又出现另一组验证码"
        },
        {
            "selector": "#px-form div label[for=opt3]",
            "text": "我已解决多个验证码问题,但仍无法进入该连结"
        },
        {
            "selector": "#px-form div label[for=opt4]",
            "text": "其他(请详细说明)"
        },
        {
            "selector": "#px-form h4:nth-of-type(1)",
            "text": "附加资讯:"
        },
        {
            "selector": "#px-form-submit",
            "text": "发送"
        }
    ]
            }
          };
        </script>
<script defer="" src="/FqtAw5et/captcha/captcha.js?a=c&u=32b1303b-2908-11ec-af10-7267594b6b65&v=&m=0"></script>
<!-- Custom Script -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300" rel="stylesheet"/>
<style>
          html, body {
            margin: 0;
            padding: 0;
            font-family: 'Open Sans', sans-serif;
            color: #000;
          }
          a {
            color: #c5c5c5;
            text-decoration: none;
          }
          .container {
            align-items: center;
            display: flex;
            flex: 1;
            justify-content: space-between;
            flex-direction: column;
            height: 100%;
          }
          .container > div {
            width: 100%;
            display: flex;
            justify-content: center;
          }
          .container > div > div {
            display: flex;
            width: 80%;
          }
          .customer-logo-wrapper {
            padding-top: 2rem;
            flex-grow: 0;
            background-color: #fff;
            visibility: (null);
          }
          .customer-logo {
            border-bottom: 1px solid #000;
          }
          .customer-logo > img {
            padding-bottom: 1rem;
            max-height: 50px;
            max-width: 100%;
          }
          .page-title-wrapper {
            flex-grow: 2;
          }
          .page-title {
            flex-direction: column-reverse;
          }
          .content-wrapper {
            flex-grow: 5;
          }
          .content {
            flex-direction: column;
          }
          .page-footer-wrapper {
            align-items: center;
            flex-grow: 0.2;
            background-color: #000;
            color: #c5c5c5;
            font-size: 70%;
          }
          @media (min-width: 768px) { html, body { height: 100%; } }
        </style>
<!-- Custom CSS -->
</head>
<body>
<section class="container">
<div class="customer-logo-wrapper">
<div class="customer-logo"><img alt="Logo" src="https://assets-global.cpcdn.com/assets/logo_cookpad_large-827bc0b34d5c7ab322d3ff8de882e9f828d06bc5ae46d09c88d25aaf02686132.png"/></div>
</div>
<div class="page-title-wrapper">
<div class="page-title">
<h1>请确认您不是机器人</h1>
</div>
</div>
<div class="content-wrapper">
<div class="content">
<div id="px-captcha"></div>
<p>很抱歉,此页面读取失败。系统侦测到您的电脑网路发出异常流量。</p> <p>可能会发生下列情况:</p> <ul> <li>Javascript 因某个扩充软件失效或者被阻挡。例如:ad blockers</li> <li>您的浏览器不支持 cookie</li> </ul> <p>请确认开启Javascript 和 cookies,以确保浏览顺利。</p>
<p> Ref ID: #32b1303b-2908-11ec-af10-7267594b6b65 </p>
</div>
</div>
<div class="page-footer-wrapper">
<div class="page-footer">
<p> Powered by <a href="https://www.perimeterx.com/whywasiblocked">PerimeterX</a> , Inc </p>
</div>
</div>
</section>
</body>
</html>

为什麽会出现以上问题?
撷取网页的 status 回传代码是 403,代表的意思是「服务器成功解析请求但是客户端没有存取该资源的权限」,也就是我们被发现是机器人,然後被挡下来了!

该怎麽办呢?难道就不能爬虫了吗?

当然不是!我们怎麽可以这麽轻易被打败呢!一山还有一山高,网页发现我们是机器人的身份,那我们就创一个假的 header 给他,让他以为其实是真人在操作就好!

import requests
from bs4 import BeautifulSoup

# 使用假header
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)'
headers = {'User-Agent': user_agent}

response = requests.get('https://cookpad.com/tw', headers=headers)
print(response.status_code)

soup = BeautifulSoup(response.content, "html.parser")
print(soup)

透过假 header 我们就可以成功爬到网页内容拉!!因为内容实在太多了,就没有全部放上来了~

200
<!DOCTYPE html>

<html class="js js--off" dir="ltr" lang="zh">
<head>
<title>Cookpad 全球最大食谱社群-超过5百万道家常料理|天天享受烹饪趣!</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport">
<meta content="Cookpad" property="og:site_name"/>
<meta content="284DFFB29E8DABE16C08409C5C68F3C6" name="msvalidate.01"/>
<meta content="ZpmT8328xJFtRmIaSnJmAEnseeQQik7RTa2VfTs14ag" name="google-site-verification"/>
<meta content="Cookpad 全球最大食谱社群-超过5百万道家常料理|天天享受烹饪趣!" property="og:title"/><meta content="找食谱吗?寻找平台记录自己的私房料理吗?这边的食谱通通任你免费浏览和收藏。欢迎加入这个的料理社群,一起天天享受烹饪趣!" name="description"/><meta content="找食谱吗?寻找平台记录自己的私房料理吗?这边的食谱通通任你免费浏览和收藏。欢迎加入这个的料理社群,一起天天享受烹饪趣!" property="og:description"/><meta content="//assets-global.cpcdn.com/assets/logo_ogp-cd3e10480377d7af945a23f409e7d311ced9cda1984e881875c74e555fadbc2f.png" property="og:image"/><meta content="1200" property="og:image:width"/><meta content="630" property="og:image:height"/><link href="https://cookpad.com/tw" rel="canonical"/><link href="https://cookpad.com/tw" hreflang="zh-tw" rel="alternate"/><meta content="https://cookpad.com/tw" property="og:url"/>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="PCc0KK5zF0zqCXaAx5Rvx3yTrql-2LVAz4BXNrc6NlyYgvzQoFjj5j3gOQQMREFtrrYNHtvLXafr5AfUdGs9ew" name="csrf-token"/>
<script>
//<![CDATA[
window.LOCALE = 'zh-tw'
//]]>
</script>
<link data-turbolinks-track="reload" href="//assets-global.cpcdn.com/packs/css/v2/application-23dbf47e.css" media="all" rel="stylesheet"/>
<link data-turbolinks-track="reload" href="//assets-global.cpcdn.com/packs/css/print-ffebe649.css" media="print" rel="stylesheet"/>
<style media="all" type="text/css">
      [data-visible-to] { display: none; }
      [data-hidden-from-guest] { display: none; }
  </style>
<script type="text/javascript">
        document.documentElement.className = document.documentElement.className.replace("js--off","js--on")
      </script>
<script type="text/javascript">
  window.__webpack_public_path__ = "//assets-global.cpcdn.com/packs/"
</script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/0-77a18e050e7a38661f11.chunk.js"></script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/3-be1654a56f358d39438b.chunk.js"></script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/application-8812d5e182bd85394e80.js"></script>
<script type="text/javascript">
      (function(){
          window._pxAppId = 'PXFqtAw5et';
          var p = document.getElementsByTagName('script')[0],
              s = document.createElement('script');
          s.async = 1;
          s.src = '/FqtAw5et/init.js';
          p.parentNode.insertBefore(s,p);
      }());
  </script>
<link href="https://use.typekit.net/zbz2cyk.css" rel="stylesheet"/>
<script>
  (function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,"script","//www.google-analytics.com/analytics.js","ga");
</script>

而假 header 其实也有套件可以使用!我们可以使用一个叫 fake-useragent 的套件来自动产生假的 header

一样先开启终端机输入以下指令安装:

pip install fake-useragent

他的使用方法也很简单,如下

from fake_useragent import UserAgent

ua = UserAgent()

接下来就可以根据需求选择你想要的 header

# Safari 的 UA
user_agent = ua.safari

# IE 的 UA
user_agent = ua.ie

# Chrome 的 UA
user_agent = ua.chrome

# 随机产生的 UA
user_agent = ua.random

透过 fake-useragent 自动生成假的 header 就可以解决被网页认出是机器人的问题了!


<<:  每日挑战,从Javascript面试题目了解一些你可能忽略的概念 - Day28

>>:  用 Python 畅玩 Line bot - 09:Video message

使用 Golang Driver 开发 Neo4j 应用程序

#前面已经分享过以 HTTP API 或 JavaScript driver 开发 Neo4j 前端...

知名云服务供应商 Liquid Web 收购 WordPress 群众募资外挂 GiveWP

说起群众募资,有时是开始新产品、新服务或各种古怪的新奇事物。但也有典型的例如线上课程、解决某一件社...

Engineering, Life Cycle Stages, and Processes

Engineering, Life Cycle Stages, and Processes Eng...

每日挑战,从Javascript面试题目了解一些你可能忽略的概念 - Day15

tags: ItIron2021 Javascript 前言 前两天我们把重点放在by refere...

使用 DOM Parser 取值

这篇会讲解怎麽样用 DOM 的 parser 把 RSS 资讯拿出来,首先我们可以先 new 一个 ...