消灭暴力扫描,世界属于渐进

本文最后更新于 2024年3月16日 上午

数仓建设过程中大部分表都是增量表,当计算过去一段时间的聚合指标时,常规的实现方式会重复扫描分区,带来大量计算的浪费。本文我们将介绍一些增量计算的方式,避免重复扫描分区,提高计算效率~

现象

当我们要计算过去一段时间的指标(如过去30天的成交量)时,最直接的实现方式就是暴力扫描多个分区的天表数据:

1
2
3
4
5
6
7
8
9
10
11
12
SELECT  pk
-- 滑动窗口:开始和结束时间每天变化
,SUM(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_sum_nd
,MAX(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_max_nd
-- 累计窗口:开始时间固定,结束时间变化
,SUM(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_sum_td
,MAX(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_max_td
FROM a_1d
-- 假定startdate > bizdate-nd
WHERE dt > '${bizdate-nd}'
AND dt <= '${bizdate}'
GROUP BY pk;

这样就会浪费大量资源:

  • 反复扫描重复分区,产出耗时长。
  • 当在同一个作业耦合不同周期指标时,若下游需产出时效保证,会重复计算短周期指标。

本文就结合阿里云MaxCompute计算服务,来讨论几种解决方案~

技术优化

递推计算

  • 适用于指标计算存在递推公式:避免暴力扫描,效率大大提升
    • 累计窗口:SUM、MAX等都可递推计算

i=0nai=i=0n1ai+an\sum^n_{i=0} a_i = \sum^{n-1}_{i=0} a_i + a_n

maxi=0nai=max(maxi=0n1ai,an)\max^n_{i=0} a_i = \max(\max^{n-1}_{i=0} a_i, a_n)

  • 滑动窗口:SUM可递推计算,但MAX不可以, 因为最大值可以处于临界窗口边界

i=1nai=i=0n1ai+ana0\sum^n_{i=1} a_i = \sum^{n-1}_{i=0} a_i + a_n - a_0

  • 缺点
    • 这世界没有银弹,当改用增量计算后,依赖需设置为自依赖,补数据时需串行执行,若需求紧急,需再暴力扫描初始化,开发成本高。

JOIN实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
SELECT  COALESCE(today.pk, yst.pk)                                                   AS pk
,SUM(COALESCE(m_sum_nd, 0) + COALESCE(today.m_1d, 0) - COALESCE(nd.m_1d, 0)) AS m_sum_nd
,SUM(COALESCE(m_sum_td, 0) + COALESCE(today.m_1d, 0)) AS m_sum_td
,GREATEST(m_max_td, today.m_1d) AS m_max_td
FROM (
SELECT pk
,m_1d
FROM a_1d
WHERE dt = '${bizdate}'
) today
FULL OUTER JOIN (
SELECT pk
,m_sum_nd
,m_sum_td
,m_max_td
FROM a_nd
WHERE dt = '${bizdate-1d}'
) yst
ON today.pk = yst.pk
LEFT JOIN (
SELECT pk
,m_1d
FROM a_1d
WHERE dt = '${bizdate-nd}'
) nd
ON COALESCE(today.pk, yst.pk) = nd.pk;
  • 缺点:join算子消耗大,且不同周期join会串行执行,当维度和周期过多时,效率提升不明显。

UNION实现

  • 优化:可并行,效率更高。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SELECT  pk
,SUM(m_sum_nd + m_1d - m_np1d) AS m_sum_nd
FROM (
SELECT pk
,m_sum_nd
,0 AS m_1d
,0 AS m_np1d
FROM a_nd
WHERE dt = '${bizdate-1d}'
UNION ALL
SELECT pk
,0 AS m_sum_nd
,m_1d
,0 AS m_np1d
FROM a_1d
WHERE dt = '${bizdate}'
UNION ALL
SELECT pk
,0 AS m_sum_nd
,0 AS m_1d
,m_1d AS m_np1d
FROM a_1d
WHERE dt = '${bizdate-nd}'
) a
GROUP BY pk;
  • 缺点:字段对齐比较麻烦,可能需要一个copliot:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
You are a SQL expert that knows use progressive method to calculate past periods metircs.
For example, when you need calculate past 30 days metric each day, you can use recurrence formula:
use yesterday 30 days metric plus today 1 day metric substract 30 days ago 1 day metric

Pay attention to use variable like ${bizdate-30}, not use date functions.

Use the following format:

Question: "Question here"
Answer: "SQL Query to run"

Here an example:
Question: "Suppose we have two tables:

CREATE TABLE a_1d (
pk STRING
,m_1d BIGINT
,n_1d BIGINT
,dt STRING
);

CREATE TABLE a_nd (
pk STRING
,m_7d BIGINT
,n_7d BIGINT
,m_30d BIGINT
,n_30d BIGINT
,dt STRING
);

How to calculate passed 7, 30 days metrics using SUM aggregation progressively by a_1d and a_nd?

Answer:

SELECT pk
,SUM(m_7d + m_1d - m_8d) AS m_7d
,SUM(n_7d + n_1d - n_8d) AS n_7d
,SUM(m_30d + m_1d - m_31d) AS m_30d
,SUM(n_30d + n_1d - n_31d) AS n_30d
FROM (
SELECT pk
,m_nd
,n_nd
,0 AS m_1d
,0 AS n_1d
,0 AS m_8d
,0 AS n_8d
,0 AS m_31d
,0 AS n_31d
FROM a_nd
WHERE dt = '${bizdate-1}'
UNION ALL
SELECT pk
,0 AS m_nd
,0 AS n_nd
,m_1d
,n_1d
,0 AS m_8d
,0 AS n_8d
,0 AS m_31d
,0 AS n_31d
FROM a_1d
WHERE dt = '${bizdate}'
UNION ALL
SELECT pk
,0 AS m_nd
,0 AS n_nd
,0 AS m_1d
,0 AS n_1d
,m_1d AS m_8d
,n_1d AS n_8d
,0 AS m_31d
,0 AS n_31d
FROM a_1d
WHERE dt = '${bizdate-7}'
UNION ALL
SELECT pk
,0 AS m_nd
,0 AS n_nd
,0 AS m_1d
,0 AS n_1d
,0 AS m_8d
,0 AS n_8d
,m_1d AS m_31d
,n_1d AS n_31d
FROM a_1d
WHERE dt = '${bizdate-30}'
) a
GROUP BY pk;"

Question: "Suppose we have two tables:
{1d table DDL}
and
{nd table DDL}

How to calculate passed {periods} days metrics using SUM aggregation progressively by {1d table name} and {nd table name} ? "

Answer: "

不可递推指标

处理nd滑动窗口中计算MAX等不可递推指标时,可以用MAP、ARRAY等复杂类型数据结构存放过去时间周期的指标,然后直接对复杂类型中的指标进行聚合计算:

  1. 每天增量维护复杂数据结构:移除(n+1)d指标并添加1d的指标。
  2. 利用复杂类型函数聚合计算:过滤出相应时间范围的指标聚合。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT  pk
,ARRAY_MAX(MAP_VALUES(MAP_FILTER(m_map, (k, v) -> k > '${bizdate-nd}'))) AS m_max_nd
FROM (
SELECT COALESCE(today.pk, yst.pk) AS pk
-- 维护过去nd的map
,MAP_CONCAT(MAP_FILTER(m_map, (k, v) -> k > '${bizdate-nd}'), MAP('${bizdate}', m_1d)) AS m_map
FROM (
SELECT pk
,m_max_nd
,m_map
FROM a_nd
WHERE dt = '${bizdate-1d}'
) yst
FULL JOIN (
SELECT pk
,m_1d
FROM a_1d
WHERE dt = '${bizdate}'
) today
ON yst.pk = today.pk
) mid;
  • 缺点:复杂类型使用技巧较高,当指标类型很多时,需存放过多指标,可能也会有性能问题。

渐进计算

修改代码的ROI还是太低,有没有更简单的方法?渐进计算带来了一些曙光,只需在暴力扫描的代码上添加一行参数即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- 开启渐进计算
SET odps.progressive.enable=true;

SELECT pk
-- 滑动窗口:开始和结束时间每天变化
,SUM(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_sum_nd
,MAX(IF(dt > '${bizdate-nd}' AND dt <= '${bizdate}', m, 0)) AS m_max_nd
-- 累计窗口:开始时间固定,结束时间变化
,SUM(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_sum_td
,MAX(IF(dt > '${startdate}' AND dt <= '${bizdate}', m, 0)) AS m_max_td
FROM a_1d
-- 假定startdate > bizdate-nd
WHERE dt > '${bizdate-nd}'
AND dt <= '${bizdate}'
GROUP BY pk;
  • 原理简析
    • 第一次运行时产生可复用的中间表,比如天表,月表等。
    • 再结合当天增量数据和中间表,生成最终结果。
  • 特点
    • 第一次由于需要生成中间表,消耗时间长,所以需要在周期前需提前运行一次。
    • 中间表的合并、拆分、生命周期由平台关联,用户不用关心。
  • 缺点
    • 由于是通用的计算优化,效果可能没有之前的增量效率高。
    • 高级特性使用较少,遇到问题时可能较难解决。

规范建模

TODO: 利用规范化研发自动生成指标,定义即研发, 计算优化统一由平台进行,不用用户关心。

总结

本文讨论了几种优化暴力扫描的技巧,但我希望这些技巧以后都不用用户操心,可以自动集成在开发平台中,开发人员只需关心业务逻辑即可~


消灭暴力扫描,世界属于渐进
https://syntomic.cn/2023/12/09/消灭暴力扫描,世界属于渐进/
作者
syntomic
发布于
2023年12月9日
更新于
2024年3月16日
许可协议