了解 requests 与 BeautifulSoup 的功能後,我们来进行整合吧!接下来我们会以 cookpad 这个料理网站来进行爬虫
import requests
from bs4 import BeautifulSoup
response = requests.get(https://cookpad.com/tw)
print(response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
print(soup)
403
<!DOCTYPE html>
<html dir="ltr" lang="zh-TW">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>很抱歉,此页面读取失败。</title>
<script>
if (typeof Turbolinks !== 'undefined') {
location.reload();
}
</script>
<!-- Px -->
<script>
window._pxAppId = 'PXFqtAw5et';
window._pxJsClientSrc = '/FqtAw5et/init.js';
window._pxFirstPartyEnabled = true;
window._pxVid = '';
window._pxUuid = '32b1303b-2908-11ec-af10-7267594b6b65';
window._pxHostUrl = '/FqtAw5et/xhr';
window._PXFqtAw5et = {
locale: 'zh-TW',
translation: {
'zh-TW': [
{
"selector": "#px-form-head span",
"text": "遇到问题 ? 请提供更多资讯"
},
{
"selector": "#px-form div label[for=opt1]",
"text": "我没有看到任何验证码"
},
{
"selector": "#px-form div label[for=opt2]",
"text": "我已解决验证码问题,但又出现另一组验证码"
},
{
"selector": "#px-form div label[for=opt3]",
"text": "我已解决多个验证码问题,但仍无法进入该连结"
},
{
"selector": "#px-form div label[for=opt4]",
"text": "其他(请详细说明)"
},
{
"selector": "#px-form h4:nth-of-type(1)",
"text": "附加资讯:"
},
{
"selector": "#px-form-submit",
"text": "发送"
}
]
}
};
</script>
<script defer="" src="/FqtAw5et/captcha/captcha.js?a=c&u=32b1303b-2908-11ec-af10-7267594b6b65&v=&m=0"></script>
<!-- Custom Script -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300" rel="stylesheet"/>
<style>
html, body {
margin: 0;
padding: 0;
font-family: 'Open Sans', sans-serif;
color: #000;
}
a {
color: #c5c5c5;
text-decoration: none;
}
.container {
align-items: center;
display: flex;
flex: 1;
justify-content: space-between;
flex-direction: column;
height: 100%;
}
.container > div {
width: 100%;
display: flex;
justify-content: center;
}
.container > div > div {
display: flex;
width: 80%;
}
.customer-logo-wrapper {
padding-top: 2rem;
flex-grow: 0;
background-color: #fff;
visibility: (null);
}
.customer-logo {
border-bottom: 1px solid #000;
}
.customer-logo > img {
padding-bottom: 1rem;
max-height: 50px;
max-width: 100%;
}
.page-title-wrapper {
flex-grow: 2;
}
.page-title {
flex-direction: column-reverse;
}
.content-wrapper {
flex-grow: 5;
}
.content {
flex-direction: column;
}
.page-footer-wrapper {
align-items: center;
flex-grow: 0.2;
background-color: #000;
color: #c5c5c5;
font-size: 70%;
}
@media (min-width: 768px) { html, body { height: 100%; } }
</style>
<!-- Custom CSS -->
</head>
<body>
<section class="container">
<div class="customer-logo-wrapper">
<div class="customer-logo"><img alt="Logo" src="https://assets-global.cpcdn.com/assets/logo_cookpad_large-827bc0b34d5c7ab322d3ff8de882e9f828d06bc5ae46d09c88d25aaf02686132.png"/></div>
</div>
<div class="page-title-wrapper">
<div class="page-title">
<h1>请确认您不是机器人</h1>
</div>
</div>
<div class="content-wrapper">
<div class="content">
<div id="px-captcha"></div>
<p>很抱歉,此页面读取失败。系统侦测到您的电脑网路发出异常流量。</p> <p>可能会发生下列情况:</p> <ul> <li>Javascript 因某个扩充软件失效或者被阻挡。例如:ad blockers</li> <li>您的浏览器不支持 cookie</li> </ul> <p>请确认开启Javascript 和 cookies,以确保浏览顺利。</p>
<p> Ref ID: #32b1303b-2908-11ec-af10-7267594b6b65 </p>
</div>
</div>
<div class="page-footer-wrapper">
<div class="page-footer">
<p> Powered by <a href="https://www.perimeterx.com/whywasiblocked">PerimeterX</a> , Inc </p>
</div>
</div>
</section>
</body>
</html>
为什麽会出现以上问题?
撷取网页的 status 回传代码是 403,代表的意思是「服务器成功解析请求但是客户端没有存取该资源的权限」,也就是我们被发现是机器人,然後被挡下来了!
该怎麽办呢?难道就不能爬虫了吗?
当然不是!我们怎麽可以这麽轻易被打败呢!一山还有一山高,网页发现我们是机器人的身份,那我们就创一个假的 header 给他,让他以为其实是真人在操作就好!
import requests
from bs4 import BeautifulSoup
# 使用假header
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)'
headers = {'User-Agent': user_agent}
response = requests.get('https://cookpad.com/tw', headers=headers)
print(response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
print(soup)
透过假 header 我们就可以成功爬到网页内容拉!!因为内容实在太多了,就没有全部放上来了~
200
<!DOCTYPE html>
<html class="js js--off" dir="ltr" lang="zh">
<head>
<title>Cookpad 全球最大食谱社群-超过5百万道家常料理|天天享受烹饪趣!</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport">
<meta content="Cookpad" property="og:site_name"/>
<meta content="284DFFB29E8DABE16C08409C5C68F3C6" name="msvalidate.01"/>
<meta content="ZpmT8328xJFtRmIaSnJmAEnseeQQik7RTa2VfTs14ag" name="google-site-verification"/>
<meta content="Cookpad 全球最大食谱社群-超过5百万道家常料理|天天享受烹饪趣!" property="og:title"/><meta content="找食谱吗?寻找平台记录自己的私房料理吗?这边的食谱通通任你免费浏览和收藏。欢迎加入这个的料理社群,一起天天享受烹饪趣!" name="description"/><meta content="找食谱吗?寻找平台记录自己的私房料理吗?这边的食谱通通任你免费浏览和收藏。欢迎加入这个的料理社群,一起天天享受烹饪趣!" property="og:description"/><meta content="//assets-global.cpcdn.com/assets/logo_ogp-cd3e10480377d7af945a23f409e7d311ced9cda1984e881875c74e555fadbc2f.png" property="og:image"/><meta content="1200" property="og:image:width"/><meta content="630" property="og:image:height"/><link href="https://cookpad.com/tw" rel="canonical"/><link href="https://cookpad.com/tw" hreflang="zh-tw" rel="alternate"/><meta content="https://cookpad.com/tw" property="og:url"/>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="PCc0KK5zF0zqCXaAx5Rvx3yTrql-2LVAz4BXNrc6NlyYgvzQoFjj5j3gOQQMREFtrrYNHtvLXafr5AfUdGs9ew" name="csrf-token"/>
<script>
//<![CDATA[
window.LOCALE = 'zh-tw'
//]]>
</script>
<link data-turbolinks-track="reload" href="//assets-global.cpcdn.com/packs/css/v2/application-23dbf47e.css" media="all" rel="stylesheet"/>
<link data-turbolinks-track="reload" href="//assets-global.cpcdn.com/packs/css/print-ffebe649.css" media="print" rel="stylesheet"/>
<style media="all" type="text/css">
[data-visible-to] { display: none; }
[data-hidden-from-guest] { display: none; }
</style>
<script type="text/javascript">
document.documentElement.className = document.documentElement.className.replace("js--off","js--on")
</script>
<script type="text/javascript">
window.__webpack_public_path__ = "//assets-global.cpcdn.com/packs/"
</script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/0-77a18e050e7a38661f11.chunk.js"></script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/3-be1654a56f358d39438b.chunk.js"></script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/application-8812d5e182bd85394e80.js"></script>
<script type="text/javascript">
(function(){
window._pxAppId = 'PXFqtAw5et';
var p = document.getElementsByTagName('script')[0],
s = document.createElement('script');
s.async = 1;
s.src = '/FqtAw5et/init.js';
p.parentNode.insertBefore(s,p);
}());
</script>
<link href="https://use.typekit.net/zbz2cyk.css" rel="stylesheet"/>
<script>
(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,"script","//www.google-analytics.com/analytics.js","ga");
</script>
而假 header 其实也有套件可以使用!我们可以使用一个叫 fake-useragent 的套件来自动产生假的 header
一样先开启终端机输入以下指令安装:
pip install fake-useragent
他的使用方法也很简单,如下
from fake_useragent import UserAgent
ua = UserAgent()
接下来就可以根据需求选择你想要的 header
# Safari 的 UA
user_agent = ua.safari
# IE 的 UA
user_agent = ua.ie
# Chrome 的 UA
user_agent = ua.chrome
# 随机产生的 UA
user_agent = ua.random
透过 fake-useragent 自动生成假的 header 就可以解决被网页认出是机器人的问题了!
<<: 每日挑战,从Javascript面试题目了解一些你可能忽略的概念 - Day28
>>: 用 Python 畅玩 Line bot - 09:Video message
#前面已经分享过以 HTTP API 或 JavaScript driver 开发 Neo4j 前端...
说起群众募资,有时是开始新产品、新服务或各种古怪的新奇事物。但也有典型的例如线上课程、解决某一件社...
Engineering, Life Cycle Stages, and Processes Eng...
tags: ItIron2021 Javascript 前言 前两天我们把重点放在by refere...
这篇会讲解怎麽样用 DOM 的 parser 把 RSS 资讯拿出来,首先我们可以先 new 一个 ...