Day27 Gin with Colly

What is Colly?

Colly是一种Golang的网路爬虫工具,而网路爬虫Web Crawler简而言之就是在网路上能够自动的进行资料搜集与解析的工具。

因此这章节我们将会介绍如何使用Colly来进行特定网域与网站的资料搜集!

Installation

go get -u github.com/gocolly/colly

How to Use Colly?

app/crawler/collier.go

package crawler

import (
	"github.com/gocolly/colly"
	"github.com/sirupsen/logrus"
	"ironman-2021/app/middleware"
)

func Collier(url string) {
	var body string
	c := colly.NewCollector(
		colly.UserAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)"),
		)
	c.OnRequest(func(r *colly.Request) {
		middleware.Logger().WithFields(logrus.Fields{
			"name": "Collier",
		}).Info("Visiting", r.URL)
	})
	c.OnError(func(_ *colly.Response, err error) {
		middleware.Logger().WithFields(logrus.Fields{
			"name": "Collier",
		}).Info("Visiting Failed, err: ", err)
	})
	c.OnResponse(func(r *colly.Response) {
		body = string(r.Body)
		middleware.Logger().WithFields(logrus.Fields{
			"name": "Collier",
		}).Info("Visited, body: ", body)
	})
	c.OnScraped(func(r *colly.Response) {
		middleware.Logger().WithFields(logrus.Fields{
			"name": "Collier",
		}).Info("Finished", r.Request.URL)
	})
	err := c.Visit(url)
	if err != nil {
		return
	}
}
  • 首先我们创造出一个Collector的实例叫c
  • 接着制定好在爬虫各个步骤时所要执行的动作,基本上除了onResponse我们是将爬虫的结果写入Log外,其余步骤都是将执行步骤写入Log之中。

main.go

server.GET("/crawler", func(c *gin.Context) {
		crawler.Collier("https://ithelp.ithome.com.tw/users/20129737/ironman/4014")
		c.String(http.StatusOK, fmt.Sprintf("Finished Coller"))
	})

最後我们则是在主程序中加一只简单的GET API来触发执行。

logs/2021-10-10.log

time="101010-10-10 1010:1010:1010" level=info msg="Health CheckInfo" name="Flynn Sun"
time="101010-10-10 1010:1010:1010" level=info msg="| 200 |      5.3626ms |      172.19.0.1 | GET | /hc |"
time="101010-10-10 1010:1010:1010" level=info msg="Visitinghttps://ithelp.ithome.com.tw/articles/10279931" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="Visited, body: <!DOCTYPE html>\n<html lang=\"zh-TW\">\n\n<head>\n    <meta charset=\"utf-8\">\n<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n\n\n<title>Day25 Gin with API Test - iT 邦帮忙::一起帮忙解决难题,拯救 IT 人的一天</title>\n\n<meta name=\"description\" content=\"What is API Test? 我们可以把它想成Unit Test单元测试的一种,不过它所涵盖的最好集合不像以往的UnitTest可能以Function为主,而是Endpoint。 透过API T...\"/>\n<meta name=\"keywords\" content=\"iT邦帮忙,iThome\">\n<meta name=\"author\" content=\"iThome\">\n<meta property=\"og:site_name\" content=\"iT 邦帮忙::一起帮忙解决难题,拯救 IT 人的一天\"/>\n<meta property=\"og:url\" content=\"https://ithelp.ithome.com.tw/articles/10279931\"/>\n<meta property=\"og:type\" content=\"website\"/>\n<meta property=\"og:title\" content=\"Day25 Gin with API Test - iT 邦帮忙::一起帮忙解决难题,拯救 IT 人的一天\"/>\n<meta property=\"og:image\" content=\"https://ithelp.ithome.com.tw/upload/images/20211010/20129737oKVtf3CBHN.png\"/>\n<meta property=\"og:description\" content=\"What is API Test? 我们可以把它想成Unit Test单元测试的一种,不过它所涵盖的最好集合不像以往的UnitTest可能以Function为主,而是Endpoint。 透过API T...\"/>\n<meta property=\"fb:app_id\" content=\"137875859607921\" />\n\n<link rel=\"apple-touch-icon\" sizes=\"57x57\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-57x57.png\">\n<link rel=\"apple-touch-icon\" sizes=\"60x60\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-60x60.png\">\n<link rel=\"apple-touch-icon\" sizes=\"72x72\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-72x72.png\">\n<link rel=\"apple-touch-icon\" sizes=\"76x76\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-76x76.png\">\n<link rel=\"apple-touch-icon\" sizes=\"114x114\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-114x114.png\">\n<link rel=\"apple-touch-icon\" sizes=\"120x120\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-120x120.png\">\n<link rel=\"apple-touch-icon\" sizes=\"144x144\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-144x144.png\">\n<link rel=\"apple-touch-icon\" sizes=\"152x152\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-152x152.png\">\n<link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-180x180.png\">\n<link rel=\"icon\" type=\"image/png\" href=\"https://ithelp.ithome.com.tw/storage/favicons/favicon-32x32.png\" sizes=\"32x32\">\n<link rel=\"icon\" type=\"image/png\"
...
href=\"https://ithelp.ithome.com.tw/storage/favicons/android-chrome-192x192.png\" sizes=\"192x192\">\n<link rel=\"icon\" type=\"image/png\" v>\n                                <div><a href=\"#\" class=\"invitation-list__account\">{{ result.account }}</a>\n                                </div>\n                            </div>\n                        </li>\n                    </ul>\n                </div>\n                <div class=\"modal-footer\">\n                    <a type=\"button\" class=\"btn btn-main\" data-dismiss=\"modal\">关闭</a>\n                </div>\n            </div>\n        </div>\n    </div>\n    </body>\n\n</html>" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="Finishedhttps://ithelp.ithome.com.tw/articles/10279931" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="| 200 |    623.3349ms |      172.19.0.1 | GET | /crawler |"

我们最後可以在log当中发现我们爬虫的纪录!

Dig deeper

那接下来则是示范难度更高的爬虫!

首先来看一下上面我们爬取的页面结构

(https://ithelp.ithome.com.tw/users/20129737/ironman/4014)

<!DOCTYPE html>
<html lang="zh-TW">

<head>
    <meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">

<title>fmt.Println(&quot;从零开始的Golang生活&quot;) :: 2021 iThome 铁人赛</title>

<meta name="description" content="讲述一位Python Developer如何从零开始学习Go,并透过该角度进行解析。"/>
<meta name="keywords" content="iT邦帮忙,iThome">
<meta name="author" content="iThome">
<meta property="og:site_name" content="iT 邦帮忙::一起帮忙解决难题,拯救 IT 人的一天"/>
<meta property="og:url" content="https://ithelp.ithome.com.tw/users/20129737/ironman/4014"/>
<meta property="og:type" content="website"/>
<meta property="og:title" content="fmt.Println(&quot;从零开始的Golang生活&quot;) :: 2021 iThome 铁人赛"/>
<meta property="og:image" content="https://ithelp.ithome.com.tw/images/ironman/13th/fb.jpg"/>
<meta property="og:description" content="讲述一位Python Developer如何从零开始学习Go,并透过该角度进行解析。"/>
<meta property="fb:app_id" content="137875859607921" />

<link rel="apple-touch-icon" sizes="57x57" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-57x57.png">
<link rel="apple-touch-icon" sizes="60x60" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-60x60.png">
<link rel="apple-touch-icon" sizes="72x72" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-72x72.png">
<link rel="apple-touch-icon" sizes="76x76" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" sizes="114x114" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-114x114.png">
<link rel="apple-touch-icon" sizes="120x120" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" sizes="144x144" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-144x144.png">
<link rel="apple-touch-icon" sizes="152x152" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-152x152.png">
<link rel="apple-touch-icon" sizes="180x180" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-180x180.png">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-32x32.png" sizes="32x32">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/android-chrome-192x192.png" sizes="192x192">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-96x96.png" sizes="96x96">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-16x16.png" sizes="16x16">
<link rel="manifest" href="https://ithelp.ithome.com.tw/storage/favicons/manifest.json">
<link rel="mask-icon" href="https://ithelp.ithome.com.tw/storage/favicons/safari-pinned-tab.svg" color="#5bbad5">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="msapplication-TileImage" content="https://ithelp.ithome.com.tw/storage/favicons/mstile-144x144.png">
<meta name="theme-color" content="#ffffff">

<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/bootstrap.min.css">
<link rel="stylesheet" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.11.3/themes/smoothness/jquery-ui.css"/>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Lato:400,700">
<link rel="stylesheet" href="//cdn.jsdelivr.net/simplemde/latest/simplemde.min.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/sweetalert.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/lib/select2/css/select2.min.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/google.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/style.css?202008271142">
<!-- highlight -->
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/railscasts.css">
<!-- end -->
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<!--messenger css-->
    </head>

<body>
    <div class="header">
    <div class="header__inner clearfix">
        <h1 class="header__logo pull-left"><a href="/"><img src="https://ithelp.ithome.com.tw/storage/image/logo.svg" alt="iT邦帮忙" class="img-responsive"></a></h1>
        <div class="header__promote">
            <div class="a12word pull-right">
                <div class="a12word__box">
                    <script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T2&channel=ithome_forum&encoding=Utf8"> </script>
                </div>
                <div class="a12word__box">
                    <script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T3&channel=ithome_forum&encoding=Utf8"> </script>
                </div>
                <div class="a12word__box">
                    <script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T4&channel=ithome_forum&encoding=Utf8"> </script>
                </div>
            </div>

            <div class="a970 pull-right">
                <script src="https://itadapi.ithome.com.tw/media/serve?type=B1&channel=ithome_forum&encoding=Utf8"></script>
            </div>
        </div>
    </div>
.............
    </body>

</html>

如果我们只想要过滤并爬取每篇铁人赛文章的标题而已,那我们可以发现文章标题的都会固定在

<body><div class="board leftside profile-main"><div class="ir-profile-content"><div class="profile-list__content">

然後每个<div class="profile-list__content">

内部都能找到<h3 class="qa-list__title"><a class="qa-list__title-link"> title </a>

...
<body>
	...
	<div class="board leftside profile-main">
		<div class="ir-profile-content">
			...
			<div class="profile-list__content">
				...
				<h3 class="qa-list__title">
					<a href="https://ithelp.ithome.com.tw/articles/10267570" class="qa-list__title-link">
		        Day4 Variable
	        </a>
	      </h3>
			...
		<div>
		...
...

因此我们透过XPATH的方式来解析并取得我们想要的title

app/crawler/collier.go


c.OnResponse(func(r *colly.Response) {
		doc, err := htmlquery.Parse(strings.NewReader(string(r.Body)))
		if err != nil {
			middleware.Logger().WithFields(logrus.Fields{
				"name": "Collier",
			}).Fatal("Visited fatal, error: ", err)
		}
		titles := htmlquery.Find(doc, `//div[@class="board leftside profile-main"]//div[@class="ir-profile-content"]//div[@class="profile-list__content"]`)
		for _, node := range titles {
			title := htmlquery.FindOne(node, `//h3[@class="qa-list__title"]//a[@class="qa-list__title-link"]/text()`)
			middleware.Logger().WithFields(logrus.Fields{
				"name": "Collier",
			}).Info("Title: ", htmlquery.InnerText(title))
		}
	})
  • 拿到Response时,首先我们先依序找到div[@class="profile-list__content"]
  • 接着用个for loop来找出每个scope当中的title info
  • 最後则是用htmlquery.InnerText()将它转成string并写入log当中

那写入log的资料会如下

time="111110-10-10 1010:1010:1010" level=info msg="Visiting: https://ithelp.ithome.com.tw/users/20129737/ironman/4014" name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day1 Why Go?\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day2 Develop Environment For Go\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day3 First Go application\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day4 Variable\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day5 Type\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day6 Array and Slice\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day7 Map and Struct\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day8 Function and Interface\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day9 Goroutine\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day10 Sync.WaitGroup & Sync.Map\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Finished: https://ithelp.ithome.com.tw/users/20129737/ironman/4014" name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="| 200 |    692.5716ms |    192.168.16.1 | GET | /crawler |"

Summary

这章节我们实作如何用Colly爬取铁人赛的页面,并打印出所有的标题,以後我们要爬取特定网域或资料时,也不用只局限於使用Python,Go也会是个好选择!

这次的程序码我也会放在下方连结提供参考

https://github.com/Neskem/Ironman-2021/tree/Day-27


<<:  Day 27: 人工智慧在音乐领域的应用 (索尼-Flow Machine、谷歌-Magenta )

>>:  27. 从学生社团到技术社群 x WTM x I-LIFE 专案

Day26:HTML(24) form(3)

"select"元素 "select"元素定义了一个下拉列表...

Day 14 event

第~14~天~罗~ 假如有开发过 Web 的都知道, 假如要设定按钮按下後的动作, 可在 html ...

【在 iOS 开发路上的大小事-Day12】好用的 CocoaPods 套件-IQKeyboardManagerSwift

前情提要 在开发 App 的时候,有遇过键盘开启的时候,TextField 却被挡住无法输入的情况吗...

【没钱买ps,PyQt自己写】Day 29 - final project - 2 / 来搞一个自己的 photoshop 吧!後段程序细节篇 (结合 PyQt + OpenCV)

看完这篇文章你会得到的成果图 此篇文章的范例程序码 github https://github.co...

Day04 | Dart基本介绍 - 变数宣告与基本型别

今天主要会说明 Dart 各种变数宣告的方法及 Dart 的基本型别。 变数宣告 dart主要有四种...