PostgreSQL 有 Array, 让我们在处理资料或是分析资料时,
十分地便利.
情境:某电商有10万种商品,上个月有一千万笔销售纪录,每笔销售有1~10种商品.
我们使用 Array,可以很方便的建立以下的 table.
create table it201012a (
id int generated always as identity
, items int[] not null
);
产生测试资料的函数
create or replace function ranitemarr()
returns int[]
as $$
select array_agg(ceil(random() * 100000)::int)
from generate_series(1, ceil(random() * 10)::int);
$$
language sql;
insert into it201012a(items)
select ranitemarr()
from generate_series(1,1e7);
commit;
接下来我们可以开发一个array内元素所有可能组合的函数.
在 stackoverflow 中 已经有一个这样的范例
https://stackoverflow.com/questions/30515990/postgresql-find-all-possible-combinations-permutations-in-recursive-query
递回啊~~~ 递回的方式,虽然已经有开发完成,但是颇占篇幅,今天先不介绍.
Python 的 itertools 里面也有 combinations 可以利用
https://docs.python.org/zh-cn/3/library/itertools.html#itertools.combinations
我们将利用 combinations 来开发三个PlPython函数.
create or replace function pyarrcombin (lst int[])
returns setof int[]
as $$
from itertools import combinations
return [list(t) for r in range(1, len(lst)+1) for t in combinations(lst,r)]
$$ language plpython3u;
create or replace function pyarrcombin (lst int[], n int)
returns setof int[]
as $$
from itertools import combinations
return [list(t) for t in combinations(lst,n)]
$$ language plpython3u;
create or replace function pyarrcombin (lst int[], n int, f int)
returns setof int[]
as $$
from itertools import combinations
return [list(t) for t in combinations(lst,n) if f in t]
$$ language plpython3u;
-------
看一下使用案例
产生全部的组合
select pyarrcombin(array[1,2,3]);
+-------------+
| pyarrcombin |
+-------------+
| {1} |
| {2} |
| {3} |
| {1,2} |
| {1,3} |
| {2,3} |
| {1,2,3} |
+-------------+
(7 rows)
产生符合指定的数目元素
select pyarrcombin(array[1,2,3],2);
+-------------+
| pyarrcombin |
+-------------+
| {1,2} |
| {1,3} |
| {2,3} |
+-------------+
(3 rows)
产生符合指定的数目元素,且包含指定元素
select pyarrcombin(array[1,2,3],2,3);
+-------------+
| pyarrcombin |
+-------------+
| {1,3} |
| {2,3} |
+-------------+
(2 rows)
--------
使用产生组合的函数,来看应用.
先来帮测试table 建立 index 吧.
create index on it201012a using gin(items);
analyze it201012a;
commit;
-------
select *
from it201012a
fetch first 3 rows only;
+----+---------------------------------------------------------------+
| id | items |
+----+---------------------------------------------------------------+
| 1 | {77367,65384,94905,83558,89606,66216,13380,89461,73248,65171} |
| 2 | {30193,61007,48154} |
| 3 | {42786} |
+----+---------------------------------------------------------------+
(3 rows)
假设以 61007 ,查询有哪些纪录有包含.
select *
from it201012a
where items @> array[61007];
+---------+---------------------------------------------------------------+
| id | items |
+---------+---------------------------------------------------------------+
| 2 | {30193,61007,48154} |
| 2706 | {61007,39231,7845,41394,22950,57433,41700,84929,90262} |
....
| 9997914 | {61007} |
+---------+---------------------------------------------------------------+
(537 rows)
查询与 61007 搭配的品项,出现次数依序降幂.
select count(*)
, pyarrcombin(items, 2, 61007)
from it201012a
where items @> array[61007]
group by 2
order by 1 desc;
+-------+---------------+
| count | pyarrcombin |
+-------+---------------+
| 2 | {61007,9629} |
| 2 | {42904,61007} |
| 2 | {61007,80684} |
| 2 | {20941,61007} |
| 2 | {61007,10792} |
| 2 | {61007,70903} |
...
| 1 | {1439,61007} |
+-------+---------------+
(3245 rows)
Time: 62.261 ms
可见得 61007 与 其他品项一起购买的情况是有不少,但没有特别明显的组合.
----
接下来找 61007 的哼哈二将
select count(*)
, pyarrcombin(items, 3, 61007)
from it201012a
where items @> array[61007]
group by 2
order by 1 desc;
+-------+---------------------+
| count | pyarrcombin |
+-------+---------------------+
| 1 | {97,15451,61007} |
| 1 | {97,16040,61007} |
...
| 1 | {99987,61007,68303} |
+-------+---------------------+
(9911 rows)
Time: 77.938 ms
速度都还不错.
---------
销量前三名的品项
select a.item
, count(*)
from (select unnest(items) as item
from it201012a) a
group by a.item
order by 2 desc
fetch first 3 rows only;
+-------+-------+
| item | count |
+-------+-------+
| 78311 | 662 |
| 13034 | 652 |
| 42643 | 646 |
+-------+-------+
(3 rows)
以78311 来查询
select count(*)
, pyarrcombin(items, 2, 78311)
from it201012a
where items @> array[78311]
group by 2
order by 1 desc;
+-------+---------------+
| count | pyarrcombin |
+-------+---------------+
| 3 | {76373,78311} |
| 2 | {4643,78311} |
| 2 | {65731,78311} |
| 2 | {49658,78311} |
| 2 | {87291,78311} |
...
| 1 | {1429,78311} |
+-------+---------------+
(4082 rows)
Time: 32.002 ms
-----------
因为产生的测试资料为乱数产生,会与现实的可能有较高相关性的资料不同.
今天介绍的应用方式,不只可以用在电商,应用的范围很广,希望能带给大家在资料分析
时有新的方式可以应用.
荀子劝学篇中有一段是这样的: 「积土成山,风雨兴焉;积水成渊,蛟龙生焉;.....。故不积蹞步,无以...
因为是打算以复习的角度来进行,所以会从最基础的JAVA,慢慢前进到物件导向的部分! 1.注解(//)...
经过了两个多星期後,我们终於开始进入 presentation layer 的部分。Presenta...
今天的题目是要我们在一个整数阵列中找到子阵列(subarray),也就是撷取阵列中相连的一部分,求出...