压缩表示日期的一种数据格式。阿里巴巴集团安全部( http://weibo.com/alisec )数据团队在清洗大数据的过程中,经常需要保存精度到天的时间数据,比如说某人在历史哪些天登录过。这样的数据很多,需要关注存储成本。daybits是为解决这个问题而设计的压缩表示日期数据的数据格式。
daybits := <beforeYears>#<years>
beforeYears := <year>;<year>; ...
years := <year>;<year>;<year>; ...
year := <quarter>,<quarter>,<quarter>,<quarter>
quarter = base64_str
其中#之前是2013以前的数据,如果#不存在,或者#之后的是2013年开始的数据。年中每个季度的数据使用逗号分隔,每个季度的数据是一个base64字符串,原始数据是一个byte数组,每个byte可以表示8天。
函数名 | 返回数据类型 | 函数类型 | UDF实现类 | 函数介绍 |
daybits_concat | String | UDAF | com.alibaba.daybits.support.odps.udf.DayBitsConcat | 聚合函数,输入参数是日期,返回daybits格式字符串,用于在原始明细数据中构建daybits数据 |
daybits_merge | String | UDAF | com.alibaba.daybits.support.odps.udf.DayBitsMerge | 聚合函数,输入参数是daybits数据,返回是合并之后的daybits数据 |
daybits_count | Bigint | UDF | com.alibaba.daybits.support.odps.udf.DayBitsCount | 查看daybits数据中存在多少天,输入是daybits数据,可以指定开始和结束时间 |
daybits_exists | Boolean | UDF | com.alibaba.daybits.support.odps.udf.DayBitsExists | 查看daybits数据中是否存在指定范围的日期,输入是daybits数据和开始和结束时间 |
daybits_explain | String | UDF | com.alibaba.daybits.support.odps.udf.DayBitsExplain | 把daybits解析为可读的字符串,输入是daybits数据,可以指定开始和结束时间 |
daybits_first | BigInt | UDF | com.alibaba.daybits.support.odps.udf.DayBitsFirst | 返回第日期 |
daybits_last | BigInt | UDF | com.alibaba.daybits.support.odps.udf.DayBitsLast | 返回最后日期 |
daybits_and | String | UDF | com.alibaba.daybits.support.odps.udf.DayBitsAnd | daybits数据的交集 |
daybits_or | String | UDF | com.alibaba.daybits.support.odps.udf.DayBitsOr | dyabits数据的并集 |
daybits_get | Boolean | UDF | com.alibaba.daybits.support.odps.udf.DayBitsGet | 判断某个日期在daybits中是否存在 |
daybits_set | String | UDF | com.alibaba.daybits.support.odps.udf.DayBitsSet | 设置daybits某一天为true |
用途:聚合函数,用于通过原始数据构建daybits。
函数定义:
STRING daybits_concat(STRING date)
参数:
date yyyymmdd格式日期字符串
返回值: daybits格式字符串
示例:
select member_id, daybits_concat(ds)
from my_table
group by member_id
用途:聚合函数,用于通过合并多个daybits。
函数定义:
STRING daybits_merge(STRING daybits)
参数:
daybits daybits格式字符串
返回值: daybits格式字符串
示例:
select member_id, daybits_merge(event_trace)
from my_table
group by member_id
用途:设置daybits数据中某一天的值
函数定义:
STRING daybits_set(STRING daybits, STRING date)
STRING daybits_set(STRING daybits, BIGINT date)
STRING daybits_set(STRING daybits, STRING date, BOOLEAN value)
STRING daybits_set(STRING daybits, BIGINT date, BOOLEAN value)
参数:
daybits daybits格式字符串
date yyyymmdd格式日期字符串,或等价整数,比如20140701
返回值: daybits格式字符串
示例:
select member_id, daybits_set(event_trace, '20140322')
from my_table
select member_id, daybits_set(event_trace, 20140322)
from my_table
select member_id, daybits_set(event_trace, '20140322', false)
from my_table
select member_id, daybits_set(event_trace, 20140322, false)
from my_table
用途: 判断daybits数据中某一天的值
函数定义:
BOOLEAN daybits_get(STRING daybits, STRING date)
BOOLEAN daybits_get(STRING daybits, BIGINT date)
参数:
daybits daybits格式字符串
date yyyymmdd格式日期字符串,或等价整数,比如20140701
返回值: 是否已经设置
示例:
select daybits_get(';AAAAAChCywMgAg==', 20140205) from dual
select daybits_get(';AAAAAChCywMgAg==', '20140205‘) from dual
用途: 判断daybits数据中存在的天数
函数定义:
BIGINT daybits_count(STRING daybits)
BIGINT daybits_count(STRING daybits, STRING start)
BIGINT daybits_count(STRING daybits, STRING start, STRING end)
BIGINT daybits_count(STRING daybits, BIGINT start)
BIGINT daybits_count(STRING daybits, BIGINT start, BIGINT end)
参数:
daybits daybits格式字符串
start 开始日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
end 结束日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
返回值: 是否已经设置
示例:
-- ;AAAAAChCywMgAg==的结果是: 20140205,20140207,20140211,20140216,20140218,20140219,20140221,20140224,20140225,20140226,20140227,20140311,20140315
select daybits_count(';AAAAAChCywMgAg==') from dual -- 返回14
select daybits_count(';AAAAAChCywMgAg==', 20140205) from dual -- 返回13
select daybits_count(';AAAAAChCywMgAg==', 20140205, 20140207) from dual -- 返回2
-- 查找20140301~20140331期间出现的记录
select *
from my_tabel
where daybits_count(event_trace, 20140301, 20140331) > 0
用途: 判断daybits数据中存在的天数
函数定义:
BOOLEAN daybits_exists(STRING daybits)
BOOLEAN daybits_exists(STRING daybits, STRING start)
BOOLEAN daybits_exists(STRING daybits, STRING start, STRING end)
BOOLEAN daybits_exists(STRING daybits, BIGINT start)
BOOLEAN daybits_exists(STRING daybits, BIGINT start, BIGINT end)
参数:
daybits daybits格式字符串
start 开始日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
end 结束日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
返回值: 是否存在
示例:
select daybits_exists(';AAAAAChCywMgAg==', 20140206, 20140207) from dual -- 返回true
-- 查找20140301~20140331期间出现的记录
select *
from my_tabel
where daybits_exists(event_trace, 20140301, 20140331)
用途: 将daybits字符串解析为可读的日期字符串
函数定义:
STRING daybits_explain(STRING daybits)
STRING daybits_explain(STRING daybits, STRING start)
STRING daybits_explain(STRING daybits, STRING start, STRING end)
STRING daybits_explain(STRING daybits, BIGINT start)
STRING daybits_explain(STRING daybits, BIGINT start, BIGINT end)
参数:
daybits daybits格式字符串
start 开始日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
end 结束日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
返回值: 以逗号分隔的日期
示例:
-- 返回 20140205,20140207,20140211,20140216,20140218,20140219,20140221,20140224,20140225,20140226,20140227,20140311,20140315
select daybits_explain(';AAAAAChCywMgAg==') from dual
select daybits_explain(';AAAAAChCywMgAg==', 20140205, 20140207) from dual -- 返回20140205,20140207
用途: 返回daybits字符串中的首次日期
函数定义:
BIGINT daybits_first(STRING daybits)
BIGINT daybits_first(STRING daybits, STRING start)
BIGINT daybits_first(STRING daybits, STRING start, STRING end)
BIGINT daybits_first(STRING daybits, BIGINT start)
BIGINT daybits_first(STRING daybits, BIGINT start, BIGINT end)
参数:
daybits daybits格式字符串
start 开始日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
end 结束日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
返回值: 返回yyyymmdd格式的整数
示例:
select daybits_first(';AAAAAChCywMgAg==') from dual -- 返回 20140205
select daybits_first(';AAAAAChCywMgAg==', 20140206, 20140207) from dual -- 返回20140207
用途: 返回daybits字符串中的末次日期
函数定义:
BIGINT daybits_last(STRING daybits)
BIGINT daybits_last(STRING daybits, STRING start)
BIGINT daybits_last(STRING daybits, STRING start, STRING end)
BIGINT daybits_last(STRING daybits, BIGINT start)
BIGINT daybits_last(STRING daybits, BIGINT start, BIGINT end)
参数:
daybits daybits格式字符串
start 开始日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
end 结束日期 yyyymmdd格式日期字符串,或等价整数,比如20140701
返回值: 返回yyyymmdd格式的整数
示例:
select daybits_last(';AAAAAChCywMgAg==') from dual -- 返回 20140315
select daybits_last(';AAAAAChCywMgAg==', 20140206, 20140207) from dual -- 返回20140207
用途: 返回两个daybits数据的交集
函数定义:
STRING daybits_and(STRING daybits_a, STRING daybits_b)
参数:
daybits_a daybits格式字符串
daybits_b daybits格式字符串
返回值: 返回两个daybits数据的交集,daybits格式字符串
示例:
select daybits_and(';AAAAAChCywMgAg==', ';AAAAAChCywMgAg==') from dual -- 返回 ';AAAAAChCywMgAg=='
用途: 返回两个daybits数据的并集
函数定义:
STRING daybits_or(STRING daybits_a, STRING daybits_b)
参数:
daybits_a daybits格式字符串
daybits_b daybits格式字符串
返回值: 返回两个daybits数据的并集,daybits格式字符串
示例:
select daybits_or(';AAAAAChCywMgAg==', ';AAAAAChCywMgAg==') from dual -- 返回 ';AAAAAChCywMgAg=='