以Postgresql为主,再聊聊资料库 递回 又见递回


情境:有许多 sensor, 一直传送资料,我们想取得最新的资料.

create unlogged table it201011c (
  id bigint not null primary key
, sensor_id int not null
, val int not null
);

insert into it201011c
select n
     , ceil(random() * 1e5)
     , random() * 100 
  from generate_series(1,1e7) n;

输入一千万笔资料,注意到这次使用了 unlogged table.
为了方便起见,id 越大,代表资料越新.
实务上可以使用id或是时间戳.

一般的作法,利用 window function 的 row_number() , rank()

select id, sensor_id, val
  from (select *
             , row_number() over(partition by sensor_id order by id desc) rn
          from it201011c) a
 where a.rn = 1;
 
(100000 rows)
Time: 14941.510 ms (00:14.942)
取得十万笔资料,速度还不赖.

explain (analyze,timing,costs)
select id, sensor_id, val
  from (select *
             , row_number() over(partition by sensor_id order by id desc) rn
          from it201011c) a
 where a.rn = 1;
 
+-------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                QUERY PLAN                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Subquery Scan on a  (cost=1658507.15..1983502.60 rows=49999 width=16) (actual time=10466.416..18693.056 rows=100000 loops=1)              |
|   Filter: (a.rn = 1)                                                                                                                      |
|   Rows Removed by Filter: 9900000                                                                                                         |
|   ->  WindowAgg  (cost=1658507.15..1858504.35 rows=9999860 width=24) (actual time=10466.414..17789.984 rows=10000000 loops=1)             |
|         ->  Sort  (cost=1658507.15..1683506.80 rows=9999860 width=16) (actual time=10466.401..12646.441 rows=10000000 loops=1)            |
|               Sort Key: it201011c.sensor_id, it201011c.id DESC                                                                            |
|               Sort Method: external merge  Disk: 254472kB                                                                                 |
|               ->  Seq Scan on it201011c  (cost=0.00..154053.60 rows=9999860 width=16) (actual time=2.679..1516.206 rows=10000000 loops=1) |
| Planning Time: 0.254 ms                                                                                                                   |
| Execution Time: 19242.739 ms                                                                                                              |
+-------------------------------------------------------------------------------------------------------------------------------------------+

观察到 Sort Key: it201011c.sensor_id, it201011c.id DESC   
执行时间在 19~14 秒.

建立 index.

create index on it201011c(sensor_id, id desc);
commit;
analyze it201011c;

select id, sensor_id, val
  from (select *
             , row_number() over(partition by sensor_id order by id desc) rn
          from it201011c) a
 where a.rn = 1;

Time: 46248.666 ms (00:46.249)

explain (analyze,timing,costs)
select id, sensor_id, val
  from (select *
             , row_number() over(partition by sensor_id order by id desc) rn
          from it201011c) a
 where a.rn = 1;
 
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                               QUERY PLAN                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Subquery Scan on a  (cost=0.43..820237.17 rows=49999 width=16) (actual time=3.364..43062.106 rows=100000 loops=1)                                                       |
|   Filter: (a.rn = 1)                                                                                                                                                    |
|   Rows Removed by Filter: 9900000                                                                                                                                       |
|   ->  WindowAgg  (cost=0.43..695238.92 rows=9999860 width=24) (actual time=3.360..41889.384 rows=10000000 loops=1)                                                      |
|         ->  Index Scan using it201011c_sensor_id_id_idx on it201011c  (cost=0.43..520241.37 rows=9999860 width=16) (actual time=3.328..35215.809 rows=10000000 loops=1) |
| Planning Time: 0.172 ms                                                                                                                                                 |
| Execution Time: 43078.930 ms                                                                                                                                            |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

虽然 index 建立的方式,也符合排序的需求,也正确的使用了 index, 但是实际存取时,反而变慢了....

因为我们建立的 index 是 id desc,与 id 生长方面相反,但要取得最新,就是要抓大的,而且还要多透过index,
没index 时,还要利用 external merge  Disk: 254472kB, 但反而速度较快.
所以资料库的优化,并不是建立 index 就必然加快速度.
 
看过昨天的利用递回来加快,那我们来看看如何利用递回来加快这样的查询.
我们透过与昨天的类似的方式,来尝试使用以下 SQL Command

with recursive t1 as (
select sensor_id, val
  from it201011c
 where id in (
       select id
         from it201011c
        where sensor_id is not null
        order by sensor_id, id desc
        limit 1
       )
union all
select (select sensor_id, val
          from it201011c
         where id in (
               select id
                 from it201011c a
                where a.sensor_id > t1.sensor_id
                  and a.sensor_id is not null
                order by a.sensor_id, a.id desc
                limit 1
              )
       )
  from t1
 where t1.sensor_id is not null
)
select sensor_id, val
  from t1
 where t1.sensor_id is not null;


ERROR:  42601: subquery must return only one column
LINE 12: select (select sensor_id, val

出现了错误讯息.....

不过方法是人想出来的,既然只能一个栏位,那我们就建立 type, 来包含两个栏位,
type 本身只算一个栏位.

create type ty1011 as (sensor_id int, val int);

with recursive t1 as (
select (sensor_id, val)::ty1011 as sr
  from it201011c
 where id in (
       select id
         from it201011c
        where sensor_id is not null
        order by sensor_id, id desc
        limit 1
       )
union all
select (select (sensor_id, val)::ty1011 as sr
          from it201011c
         where id in (
               select id
                 from it201011c a
                where a.sensor_id > (t1.sr).sensor_id
                  and a.sensor_id is not null
                order by a.sensor_id, a.id desc
                limit 1
              )
       )
  from t1
 where (t1.sr).sensor_id is not null
)
select (t1.sr).sensor_id, (t1.sr).val
  from t1
 where t1.* is not null;
 
Time: 12371.356 ms (00:12.371)
Time: 5858.400 ms (00:05.858)

第一次执行会久一些 12秒,第二次就降到 5秒.

这样我们就可以利用递回,来取得十万个sensor的最新资料,而且速度也较快.
Execution Time: 3929.956 ms    <-- 第三次, 查看执行计画时,更快了!
执行计画就不列出了,有兴趣的,可以自己测试看看.


<<:  【Tailwind CSS 教学 - 14】透过 Tailwind 达到容器与内容分离

>>:  Ray Dalio 的浅白经济学笔记

自动化测试,让你上班拥有一杯咖啡的时间 | Day 25 - 学习 blur 的用法

此系列文章会同步发文到个人部落格,有兴趣的读者可以前往观看喔。 focus 必须是 DOM 元素才...

[Day 28] 阿嬷都看得懂的怎麽操纵 DOM

阿嬷都看得懂的怎麽操纵 DOM 前两天我们介绍了厉害的神灯精灵,以及 JavaScript 这个神灯...

Day20 Vue元件中的网页模板

我们现在应该大概了元件到底在干嘛了,今天的话我们就来聊聊HTML模板怎麽写在我们的元件里面,通常我们...

IT 铁人赛 k8s 入门30天 -- day11 Helm - Package Manager

前言 今天要来讲说 Helm 这个工具 因为 Helm 版本会不断推陈出新, 因此在今天的内容着重再...

离职倒数30天:成年人生活的基础

因为我提离职时还剩两个月的假,所以其实我已经放假放了一个月了。当微软员工只剩最後30天,我想回顾一下...