Day 9：架设 Prometheus (1)

昨天我们成功的让 Prometheus 可以采集到一些指标了，可是为了了解服务的状态，我们还需要自己提供指标，像是以 web server 来讲，可能就需要诸如 HTTP 请求相关的指标、机器上面的硬体资讯、然後还有资料库的资讯，都是为了监控服务所需的重要指标。

那麽我们要如何自己生出指标呢，这个动作称之为 instrument（繁中方面我还没找到适当的翻译，似乎有简中翻为「检测」），可以分成透过 client library 汇出，或是透过 exporter / integration，exporter 通常会是一些可执行档，可以帮助我们爬取一些资讯并转换成 Prometheus 可以接受的格式。然而也有一些工具是本身就有汇出 Prometheus 的指标的，这种情况下我们就不需要再额外设定 exporter，例如我常使用的 caddy 就有提供指标。

设定 node exporter

那麽首先我们就来设定一个 exporter 汇出指标吧，这边纪录一下 node exporter 的过程。

node exporter 主要用来汇出机器本身相关的资讯，包含 CPU、记忆体和硬碟用量等等，如果我这边写得不够清楚的话，也可以参考 grafana 官方文件的教学（因为我是使用 docker 部署的，不然 Prometheus 也有提供直接部署在机器上的版本）。

其实部署的方式很单纯，就直接把这个 config 加进去 docker-compose.yml 的 services 里面就好了：

node-exporter:
  image: prom/node-exporter:v1.2.2
  restart: unless-stopped
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
  command:
    - '--path.procfs=/host/proc'
    - '--path.rootfs=/rootfs'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    - '--no-collector.arp'
    - '--no-collector.netstat'
    - '--no-collector.netdev'
    - '--no-collector.softnet'

然後，在 prometheus.yml 里面新增 node exporter 这个 target：

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
  # 加上这个！
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]

现在重开服务 (docker-compose up -d)，应该就会看到 node exporter 汇出的那些指标了，像是下面我就查了一下硬碟的使用率。（我是使用 WSL2，所以才会有那个路径）

关於使用 docker image 部署 node exporter 的一些问题

接下来，来谈谈使用 docker 的话会遇到一些什麽问题吧。今天这样用下来真的是觉得 README 里面提到不建议使用 docker 是合理的，遇到不少麻烦。不过个人因为比较希望这些服务都用 docker 管理，所以最後还是这麽做了。

（Docker Desktop for Windows/Mac, Docker EE for Windows Server 限定）不能使用 host networking

因为在 repo 上面的 README 写说 network_mode 要用 host，结果跑起来之後我发现不管怎样就是连不上，换到 linux 的机器上面测试却又正常，找了好久才在 docker 的文件上找到这段话：

The host networking driver only works on Linux hosts, and is not supported on Docker Desktop for Mac, Docker Desktop for Windows, or Docker EE for Windows Server.

原来并不是任何版本的 docker 都可以使用 host mode 的...以前从来没有注意过。

要有多个 bind mount volume

本来我看 GitHub repo 上面的范例只有挂载 / 这个路径，然而 grafana 那边的确有挂了三个路径，可是我觉得 /proc 跟 /sys 应该都会包含在 / 底下吧，而且又看到 README 上面这样写：

The node_exporter will use path.rootfs as prefix to access host filesystem.

所以一开始就只有把 / 挂到 /rootfs 底下，可是後来发现好像有些 collector 没有在运作，仔细翻了一遍文件才发现这句话：

Be aware that any non-root mount points you want to monitor will need to be bind-mounted into the container.

所以说每个路径都需要分开 mount 才行，翻了一下目前的 source code，看起来应该只有这三个选项：

path.procfs
path.sysfs
path.rootfs

var (
	// The path of the proc filesystem.
	procPath   = kingpin.Flag("path.procfs", "procfs mountpoint.").Default(procfs.DefaultMountPoint).String()
	sysPath    = kingpin.Flag("path.sysfs", "sysfs mountpoint.").Default("/sys").String()
	rootfsPath = kingpin.Flag("path.rootfs", "rootfs mountpoint.").Default("/").String()
)

至於为什麽需要特意分开呢，我想有兴趣但不知道原因的人，可以去查看看 procfs 跟 sysfs 是什麽东西。

某些 collector 无法使用（若没有 host networking 的话）

应该有些人会发现，在上面的设定里面我禁用了一些 collector，原因是因为，那些 collector 会需要存取 /host/proc/net（也就是 host 上的 /proc/net），可是在没有禁用他们的时候，你可能会在 node exporter 的 log 里面找到这些错误讯息：

node-exporter_1  | level=error ts=2021-09-23T12:48:38.490Z caller=collector.go:169 msg="collector failed" name=netdev duration_seconds=2.49e-05 err="couldn't get netstats: open /host/proc/net/dev: no such file or directory"
node-exporter_1  | level=error ts=2021-09-23T12:48:38.491Z caller=collector.go:169 msg="collector failed" name=softnet duration_seconds=2.45e-05 err="could not get softnet statistics: open /host/proc/net/softnet_stat: no such file or directory"
node-exporter_1  | level=error ts=2021-09-23T12:48:38.492Z caller=collector.go:169 msg="collector failed" name=netstat duration_seconds=1.43e-05 err="couldn't get netstats: open /host/proc/net/netstat: no such file or directory"
node-exporter_1  | level=error ts=2021-09-23T12:48:38.492Z caller=collector.go:169 msg="collector failed" name=arp duration_seconds=0.0001197 err="could not get ARP entries: open /host/proc/net/arp: no such file or directory"

虽然 host 上是可以看到那些档案的，但是在 container 里面却找不到。後来我在 metricbeat 的文件上找到了下面这段话：

The system network metricset uses data from /proc/net/dev, or /hostfs/proc/net/dev when using -system.hostfs=/hostfs. The only way to make this file contain the host’s network devices is to use the --net=host flag. This is due to Linux namespacing; simply bind mounting the host’s /proc to /hostfs/proc is not sufficient.

虽然这不是 node exporter，但可以看到要存取 /proc/net 的话，需要 container 使用 host network 才行。

汇出 caddy 的指标

除了机器的指标以外，我们应该也会想要了解 web server 的指标，本来想透过 statsd exporter 去汇出 gunicorn 的指标的，後来想到 caddy 本身不是就有提供了吗，而且还能顺便连前端的部分都一起纪录了。

根据官方文件的说明，我本来尝试直接透过同个 bridge network 底下的 container 去打 http://caddy:2019/metric，然而却发现 caddy 的 admin API 只有开给 localhost，既然这样那我就自己开一个地方汇出指标吧。

在 caddyfile 里面加上以下这个 block，就能透过 http://caddy:3939/metric 拿到指标了。

:3939 {
        metrics /metrics
}

用 curl 实验一下，看起来正常：

接着修改 prometheus.yml，让他去抓 caddy 的指标：

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]
  # 加上这个！
  - job_name: caddy
    static_configs:
      - targets: ["caddy:3939"]

重开服务之後，打开 Prometheus 的 UI，输入 caddy 应该就会看到一些指标了：

小结

本来以为今天可以顺便把 Alertmanager 一起讲完的，结果处理 node exporter 花了太多时间，如果没特别理由的话似乎真的是不要用 docker 部署比较好，或许之後我可以研究看看，在其他服务都用 docker compose 部署的情况下，该怎麽管理这种在本机上面部署的元件。

<<: TypeScript 能手养成之旅 Day 7 物件型别-函式型别

>>: Day 8 - Laravel Request validation

Day 9：架设 Prometheus (1)

设定 node exporter

关於使用 docker image 部署 node exporter 的一些问题

（Docker Desktop for Windows/Mac, Docker EE for Windows Server 限定）不能使用 host networking

要有多个 bind mount volume

某些 collector 无法使用（若没有 host networking 的话）

汇出 caddy 的指标

小结

Day 10 : 存放资料的收纳库-串列资料(下)

【程序】给 23 - 28 岁的你的一封信转生成恶役菜鸟工程师避免 Bad End 的 30 件事 - 29

python30天-DAY29-Matplotlib(4)

Day26 指派角色给使用者

D9 文件系统核心开始系统页面功能规划

GPU程序设计(1) -- Hello CUDA !

用React刻自己的投资Dashboard Day16 - react-router-dom让SPA也有路由

Proof of Work 工作量证明

#3 The V8 Engine

[Day05] pod service node kubectl