OceanBase 分析函数

2021-06-09 16:01 更新

简介

分析函数(某些数据库下也叫做窗口函数)与聚合函数类似,计算总是基于一组行的集合,不同的是,聚合函数一组只能返回一行,而分析函数每组可以返回多行,组内每一行都是基于窗口的逻辑计算的结果。分析函数可以显著优化需要 self-join 的查询。

分析函数语法

“窗口”也称为 FRAME,OceanBase 数据库同时支持 ROWS 与 RANGE 两种 FRAME 语义,前者是基于物理行偏移的窗口,后者则是基于逻辑值偏移的窗口。

分析函数语法如下:

analytic_function:
  analytic_function([ arguments ]) OVER (analytic_clause)

analytic_clause:
  [ query_partition_clause ] [ order_by_clause [ windowing_clause ] ]

query_partition_clause:
  PARTITION BY { expr[, expr ]... | ( expr[, expr ]... ) }

order_by_clause:
  ORDER [ SIBLINGS ] BY{ expr | position | c_alias } [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] [, { expr | position | c_alias } [ ASC | DESC ][ NULLS FIRST | NULLS LAST ]]...

windowing_clause:
  { ROWS | RANGE } { BETWEEN { UNBOUNDED PRECEDING | CURRENT ROW | value_expr {
  PRECEDING | FOLLOWING } } AND{ UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { 
  PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW| value_expr 
  PRECEDING}}

SUM/MIN/MAX/COUNT/AVG

声明

SUM 的语法为:SUM([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

MIN 的语法为:MIN([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

MAX 的语法为:MAX([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

COUNT 的语法为:COUNT({ * | [ DISTINCT | ALL ] expr }) [ OVER (analytic_clause) ]

AVG 的语法为:AVG([ DISTINCT | ALL ] expr) [ OVER(analytic_clause) ]

说明

以上分析函数都有对应的聚合函数,其中,SUM 返回 expr 的和,MIN/MAX 返回 expr 的最小值/最大值,COUNT 返回窗口中查询的行数,AVG 返回 expr 的平均值。

对于 COUNT 函数,如果指定了 expr,即返回 expr 不为 NULL 的统计个数,如果指定 COUNT(*) 返回所有行的统计数目。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.17 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.03 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient>select last_name, sum(salary) over(partition by job_id) totol_s, min(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s, count(*) over(partition by job_id) count_s from exployees;
+-----------+---------+-------+-------+---------+
| last_name | totol_s | min_s | max_s | count_s |
+-----------+---------+-------+-------+---------+
| jim       |    2000 |  2000 |  2000 |       1 |
| mike      |   36000 | 11000 | 13000 |       3 |
| lily      |   36000 | 11000 | 13000 |       3 |
| tom       |   36000 | 11000 | 13000 |       3 |
+-----------+---------+-------+-------+---------+
4 rows in set (0.01 sec)

NTH_VALUE/FIRST_VALUE/LAST_VALUE

声明

NTH_VALUE 的语法为:NTH_VALUE (measure_expr, n) [ FROM { FIRST | LAST } ] [ { RESPECT | IGNORE } NULLS ] OVER (analytic_clause)

FIRST_VALUE 的语法为:FIRST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)

LAST_VALUE 的语法为:LAST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)

说明

NTH_VALUE 函数表示第几个值,方向由 [ FROM { FIRST | LAST } ] 确定,默认为 FROM FIRST,含有是否忽略 NULL 值的标志。其窗口为统一的 analytic_clause。这里 n 应该是正数,如果 n 是 NULL,函数将返回错误;如果 n 大于窗口内所有的行数,此函数将返回 NULL。

FIRST_VALUE 和 LAST_VALUE 表示从第一个开始计数或者是从最后一个开始计数。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> select last_name, first_value(salary) over(partition by job_id) totol_s, last_value(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s from exployees;
+-----------+---------+-------+-------+
| last_name | totol_s | min_s | max_s |
+-----------+---------+-------+-------+
| jim       |    2000 |  2000 |  2000 |
| mike      |   12000 | 11000 | 13000 |
| lily      |   12000 | 11000 | 13000 |
| tom       |   12000 | 11000 | 13000 |
+-----------+---------+-------+-------+
4 rows in set (0.01 sec)

LEAD/LAG

声明

LEAD 的语法为:LEAD { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)

LAG 的语法为:LAG { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)

说明

LEAD 和 LAG 含义为可以在一次查询中取出当前行的同一个字段的前面或后面第 N 行的数据,这种操作可以使用相同表的自连接来实现,但 LEAD/LAG 窗口函数有更高的效率。

其中,value_expr 是要做比对的字段,offset 是 value_expr 的偏移量,default 参数的默认值为 NULL,即如果在 LEAD/LAG 没有显示的设置 default 值的情况下,返回值为 NULL。例如:对 LAG 来说,当前行为 4,offset 值为 6,这时候所要找的数据就是第 -2 行,不存在此行即返回 default 的值。

[ { RESPECT | IGNORE } NULLS ] 的语法为是否考虑 NULL 值,默认为 RESPECT,考虑 NULL 值。

注意 LEAD/LAG 两个函数后必须有 order_by_clause,数据应该在一个列上排序之后才能有前多少行后多少行的概念。query_partition_clause 是可选的,如果没有 query_partition_clause,就是全局的数据。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, lead(salary) over(order by salary) lead, lag(salary) over(order by salary) lag from exployees;
+-----------+-------+-------+
| last_name | lead  | lag   |
+-----------+-------+-------+
| jim       | 11000 |  NULL |
| tom       | 12000 |  2000 |
| mike      | 13000 | 11000 |
| lily      |  NULL | 12000 |
+-----------+-------+-------+
4 rows in set (0.01 sec)

STDDEV/VARIANCE/STDDEV_SAMP/STDDEV_POP

声明

VARIANCE 的语法为:VARIANCE([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

STDDEV 的语法为:STDDEV([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

STDDEV_SAMP 的语法为:STDDEV_SAMP(expr) [ OVER (analytic_clause) ]

STDDEV_POP 的语法为:STDDEV_POP(expr) [ OVER (analytic_clause) ]

说明

VARIANCE 返回的是 expr 的方差,expr 可能是数值类型或者可以转换成数值类型的类型,方差的类型和输入的值的类型相同。

STDDEV 返回的是 expr 的标准差,参数类型方面和 VARIANCE 的相同。

STDDEV_SAMP 返回的是样本标准差。

STDDEV_POP 返回的是总体标准差。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, stddev(salary) over(order by salary) std, variance(salary) over(order by salary) var, stddev_pop(salary) over() std_pop, stddev_samp(salary) over() from exployees;
+-----------+-------------------+--------------------+-------------------+----------------------------+
| last_name | std               | var                | std_pop           | stddev_samp(salary) over() |
+-----------+-------------------+--------------------+-------------------+----------------------------+
| jim       |                 0 |                  0 | 4387.482193696061 |          5066.228051190222 |
| tom       |              4500 |           20250000 | 4387.482193696061 |          5066.228051190222 |
| mike      | 4496.912521077347 | 20222222.222222224 | 4387.482193696061 |          5066.228051190222 |
| lily      | 4387.482193696061 |           19250000 | 4387.482193696061 |          5066.228051190222 |
+-----------+-------------------+--------------------+-------------------+----------------------------+
4 rows in set (0.00 sec)

NTILE

声明

NTILE(expr) OVER ([ query_partition_clause ] order_by_clause)

说明

NTILE 函数将分区中已经排序的行划分为大小尽可能相同的指定数量的分组,并返回给每行组号。expr 如果是 NULL,则返回 NULL。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, ntile(10) over(partition by job_id order by salary) ntl from exployees;
+-----------+------+
| last_name | ntl  |
+-----------+------+
| jim       |    1 |
| tom       |    1 |
| mike      |    2 |
| lily      |    3 |
+-----------+------+
4 rows in set (0.01 sec)

ROW_NUMBER

声明

ROW_NUMBER( ) OVER ([ query_partition_clause ] order_by_clause)

说明

ROW_NUMBER 函数按照 order_by_clause 子句中指定的行的顺序,为每一行分配一个编号。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, row_number() over(partition by job_id order by salary) ntl from exployees;
+-----------+------+
| last_name | ntl  |
+-----------+------+
| jim       |    1 |
| tom       |    1 |
| mike      |    2 |
| lily      |    3 |
+-----------+------+
4 rows in set (0.00 sec)

RANK/DENSE_RANK/PERCENT_RANK

RANK 的语法为:RANK( ) OVER ([ query_partition_clause ] order_by_clause)

DENSE_RANK 的语法为:DENSE_RANK( ) OVER([ query_partition_clause ] order_by_clause)

PERCENT_RANK 的语法为:PERCENT_RANK( ) OVER ([ query_partition_clause ] order_by_clause)

说明

RANK 计算每一行数据在某列上的排序,该列由 order_by_clause 中的列决定。例如,按照 salary 排序可以看出员工的收入排名。

DENSE_RANK 的语义基本和 RANK 函数相同,但是 RANK 的排序中间会有‘跳过’,但是 DENSE_RANK 中不会有。

PERCENT_RANK 的语义基本和 RANK 函数相同,但是 PERCENT_RANK 排序的结果是百分比,计算的是给定行的百分比。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.10 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, rank() over(partition by job_id order by salary) rank, dense_rank() over(partition by job_id order by salary) dense_rank, percent_rank() over(partition by job_id order by salary) percent_rank from exployees;
+-----------+------+------------+----------------------------------+
| last_name | rank | dense_rank | percent_rank                     |
+-----------+------+------------+----------------------------------+
| jim       |    1 |          1 | 0.000000000000000000000000000000 |
| tom       |    1 |          1 | 0.000000000000000000000000000000 |
| mike      |    2 |          2 | 0.500000000000000000000000000000 |
| lily      |    3 |          3 | 1.000000000000000000000000000000 |
+-----------+------+------------+----------------------------------+
4 rows in set (0.01 sec)

CUME_DIST

声明

CUME_DIST() OVER ([ query_partition_clause ] order_by_clause)

说明

该函数计算一个值的分布,返回值为大于 0 小于等于 1 的值。作为一个分析函数,CUME_DIST 在升序情况下计算比当前行的特定列小的数据的占比。例如如下例子中,按 job_id 分组并在薪水排序的情况下,每行数据在窗口内的排序列上的占比。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.10 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, cume_dist() over(partition by job_id order by salary) cume_dist from exployees;
+-----------+----------------------------------+
| last_name | cume_dist                        |
+-----------+----------------------------------+
| jim       | 1.000000000000000000000000000000 |
| tom       | 0.333333333333333333333333333333 |
| mike      | 0.666666666666666666666666666667 |
| lily      | 1.000000000000000000000000000000 |
+-----------+----------------------------------+
4 rows in set (0.01 sec)


以上内容是否对您有帮助:
在线笔记
App下载
App下载

扫描二维码

下载编程狮App

公众号
微信公众号

编程狮公众号