Skip to content

Commit

Permalink
feat: clickhouse
Browse files Browse the repository at this point in the history
  • Loading branch information
li-zeyuan committed Mar 31, 2024
1 parent 056e218 commit 2e85d95
Show file tree
Hide file tree
Showing 24 changed files with 577 additions and 11 deletions.
14 changes: 7 additions & 7 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ timezone: Asia/Shanghai
# jekyll-seo-tag settings › https://github.com/jekyll/jekyll-seo-tag/blob/master/docs/usage.md
# ↓ --------------------------

title: Here # the main title
title: Ahern # the main title

tagline: 广阔天地,大有作为 # it will display as the sub-title
tagline: 广阔天地,大有作为 # it will display as the sub-title

description: >- # used by seo meta and the atom feed
A minimal, responsive and feature-rich Jekyll theme for technical writing.
Expand All @@ -34,7 +34,7 @@ twitter:
social:
# Change to your full name.
# It will be displayed as the default author of the posts and the copyright owner in the Footer
name: Here
name: Ahern
email: [email protected] # change to your email address
links:
# The first element serves as the copyright owner's link
Expand Down Expand Up @@ -76,7 +76,7 @@ theme_mode: # [light | dark]
img_cdn:

# the avatar on sidebar, support local or CORS resources
avatar:
avatar: ./assets/avatar.png

# The URL of the site-wide social preview image used in SEO `og:image` meta tag.
# It can be overridden by a customized `page.image` in front matter.
Expand All @@ -86,14 +86,14 @@ social_preview_image: # string, local or CORS resources
toc: true

comments:
active: # The global switch for posts comments, e.g., 'disqus'. Keep it empty means disable
active: utterances # The global switch for posts comments, e.g., 'disqus'. Keep it empty means disable
# The active options are as follows:
disqus:
shortname: # fill with the Disqus shortname. › https://help.disqus.com/en/articles/1717111-what-s-a-shortname
# utterances settings › https://utteranc.es/
utterances:
repo: # <gh-username>/<repo>
issue_term: # < url | pathname | title | ...>
repo: li-zeyuan/li-zeyuan.github.io # <gh-username>/<repo>
issue_term: pathname # < url | pathname | title | ...>
# Giscus options › https://giscus.app
giscus:
repo: # <gh-username>/<repo>
Expand Down
4 changes: 4 additions & 0 deletions _data/authors.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
ahern:
name: Ahern
twitter: https://twitter.com/ahren_utf
url: https://github.com/li-zeyuan
1 change: 0 additions & 1 deletion _posts/.placeholder

This file was deleted.

51 changes: 48 additions & 3 deletions _posts/2024-03-27-architecture_mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@ title: 常见架构模式
date: 2024-03-27 00:00:00 +0800
categories: [root, architecture]
tags: [architecture]
author: ahern
---

### 分层模式
每一层有特定的角色和职责;请求逐层向下传递,并逐层向上返回
![img.png](./assets/images/img_6.png){:height="10%" width="50%"}
每一层有特定的角色和职责;请求逐层向下传递,并逐层向上返回
- 展示层(View):用户UI页面,请求输入和响应展示
- 控制层(Control):执行业务逻辑
- 应用层(Service):控制层和数据层的桥梁
Expand All @@ -34,8 +35,8 @@ tags: [architecture]
缺点:

### 事件驱动
事件产生并发送到Channel,由事件调度器调度到不同的处理器执行
![img.png](./assets/images/img_8.png){:height="10%" width="50%"}
事件产生并发送到Channel,由事件调度器调度到不同的处理器执行
- 事件产生:各种业务场景下触发生成事件
- Channel:事件队列
- 调度器:从Channel中取出事件,并调度到不同的处理器
Expand All @@ -46,11 +47,55 @@ tags: [architecture]
- 大型复杂系统下,处理异步事件

### 阶段事件驱动
![img.png](./assets/images/img.png){:height="10%" width="50%"}
简称SEDA(stage event driver architecture),各个stage之间通过event通讯,stage内处理event使用线程池异步处理。结合了事件驱动和多线程模式优点。

场景:
- Pipeline处理
- 可异步

优点:
- 易扩展
- 事件驱动,异步,解耦
- 充分利用多线程模型

### Pipeline管道-过滤器
![img.png](./assets/images/img_1.png){:height="10%" width="50%"}
统一过滤器之间通讯协议,每个管道都是非定向的和点对点,接收一个源的输入和输出到另一个源

场景:
- 流水线处理
- 语法分词

优点:
- 可插拔、易扩展

缺点:
- 不适合交互性系统
- 过滤器之间频繁解析和反解析导致性能损失

### 微服务
![img.png](./assets/images/img_3.png){:height="10%" width="50%"}
将系统划分多个独立服务,每个服务可以独立部署,拥有自己的api边界,管理自己的数据库,可以是不同的开发语言。

场景:
- 适合大型系统

优点:
- 容灾性好
- 配合k8s扩缩容

缺点:
- 系统设计必须容忍服务失败
- 分布式事务问题
- 运维复杂

### 基于空间
![img.png](./assets/images/img_4.png){:height="10%" width="50%"}

### 参考
- [阶段式服务器模型](https://zh.wikipedia.org/wiki/%E9%98%B6%E6%AE%B5%E5%BC%8F%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%A8%A1%E5%9E%8B)
- [架构师必须了解的 5 种最佳软件架构模式](https://www.infoq.cn/article/vrquohwkwjjghb5wvx1y)
- [程序员必知的几种软件架构模式](https://www.infoq.cn/article/6rx047oohjlrdipd1bc2)
- [阶段式服务器模型](https://zh.wikipedia.org/wiki/%E9%98%B6%E6%AE%B5%E5%BC%8F%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%A8%A1%E5%9E%8B)
- [SEDA架构实现](https://blog.51cto.com/SoyTechnology/3346495)
- [基于空间的架构](http://www.uml.org.cn/zjjs/202212164.asp)
11 changes: 11 additions & 0 deletions _posts/2024-03-29-architecture_answer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
title: 关于架构设计问答
date: 2024-03-29 00:00:00 +0800
categories: [root, architecture]
tags: [architecture, interview]
author: ahern
---

### 你会如何做架构设计改造?为什么?
- 模版:1、复杂来源,2、解决方案,3、评估标准,4、技术实现
- 参考:https://www.cnblogs.com/edisonchou/p/architecture_design_learning_in_5mins_part3.html
193 changes: 193 additions & 0 deletions _posts/2024-03-29-clickhouse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
---
title: Clickhouse笔记
date: 2024-03-29 00:00:00 +0800
categories: [root, clickhouse]
tags: [clickhouse]
author: ahern
---
## 概述
- 端口8123:tcp转发
- 端口9000:http转发
- 端口9009:interserverHTTPPort
- 列式存储:磁盘存储按列存储
- 多核并行:分区间并行处理数据(查询,合并等)
- 多种表引擎:常用MergeTree Family,Log Family,Integrations
- 写入建议:一批大于1000行,或每秒不超过一个写入请求

## 数据类型
- 整型:Int8、Int16、Int32、Int64;UInt8、UInt16、UInt32、UInt64
- 浮点型:Float32、Float64
- 布尔值:无该类型,用UInt8代替
- 字符串:String,FixedString(N)
- 时间类型:Date,Datetime,Datetime64
- LowCardinality: 对数据类型进行二次字典编码;修改底层数据存储
- 适用场景:原始数据冗长,去重后的计数值<1000
- 优点:降低磁盘存储空间,提高查询性能
- 缺点:写性能有所下降
- 参考
- https://blog.csdn.net/jiangshouzhuang/article/details/103268340
- https://github.com/ClickHouse/clickhouse-presentations/blob/master/meetup19/string_optimization.pdf
- https://blog.csdn.net/jiangshouzhuang/article/details/103268340
- 枚举:对比LowCardinality,枚举更加适合静态字典的场景
- 数组
- 不推荐多维数组
- 参考:https://clickhouse.com/docs/en/sql-reference/data-types/

## 目录结构
```
root@/data/clickhouse# tree -L 1
.
|-- data // 数据、表元数据
|-- format_schemas
|-- log // ck-server日志文件
`-- tmp
root@/data/clickhouse/data# tree -L 1
.
|-- data // 数据、索引文件
|-- dictionaries_lib
|-- flags
|-- metadata // 表元数据
|-- metadata_dropped
|-- preprocessed_configs
|-- status
`-- store // data下数据文件是以软连接到store目录
// 表partition
root@/data/clickhouse/data/data/{database}/{table}# tree -L 1
.
|-- 20220917_20_20_0 // patition
|-- 20220917_21_21_0
|-- detached // 记录损坏partition
`-- format_version.txt // version
// partition目录
root@/data/clickhouse/data/data/{database}/{table}/{patition}# tree -L 1
.
|-- checksums.txt // 校验文件
|-- columns.txt // 列信息(字段名、类型)
|-- count.txt // 总数
|-- data.bin // 数据
|-- data.mrk3 // 数据标记文件,索引文件会用到该标记
|-- default_compression_codec.txt // 压缩
|-- minmax_timestamp.idx // 分区minimal索引
|-- partition.dat // 分区信息
`-- primary.idx // 主键索引
```
## partition命名规则
```
20220917_1_1_0
[分区名]-[最小分区块编号]-[最大分区块编号]-[合并数次]
分区名:跟partition by参数有关,有整数字符串、日期、哈希值
分区块编号:新生成的分区自增
合并数次:合并一次加1
```

## 表引擎

#### MergeTree
- 支持索引和分区
- partition by(optional):指定分区规则,一般是按时间
- primary key(optional):主键,只提供一级索引,没有唯一约束
- order by(required):分区内排序,主键必须是order by字段的前缀字段
- settings(optional):一些额外控制参数,如index_granularity索引粒度,默认8192
- TTL:支持列ttl,表级ttl
- 参考:https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree

#### ReplacingMergeTree
- 在MergeTree的基础上增加去重功能
- 入参为版本字段:如engine =ReplacingMergeTree(create_time)
- 根据order by字段进行去重,不能跨分区去重
- 在分区合并时才进行去重
- 参考:https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree

#### SummingMergeTree
- 入参为需要汇总的字段:如ENGINE = SummingMergeTree([columns])
- 以order by列为维度
- 同一分区才会做聚合处理
- 分区合并时进行聚合
- 参考:https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/summingmergetree

#### Distributed
![img.png](./assets/images/img_2.png){:height="10%" width="50%"}
- 分布式表:逻辑表,对该表进行操作时,会被路由到本地,然后汇总结果返回给用户
- 本地表:实际存储数据的表
- 常见分布式集群方案:
- MergeTree + Distributed
- MergeTree + Distributed+集群复制
- ReplicatedMergeTree + Distributed
- 参考
- https://zhuanlan.zhihu.com/p/161242274
- https://clickhouse.com/docs/en/engines/table-engines/special/distributed
- https://www.cnblogs.com/yisany/p/13524018.html

#### \*MergeTree与Replicated\*MergeTree区别
- \*MergeTree数据同步依赖数据库同步机制,不依赖zookeeper
- Replicated\*MergeTree依赖zookeeper
- 参考:https://juejin.cn/post/6875235444909408263

## 索引(MergeTree)
参考:https://sobriver.top/2021/07/07/%E7%BC%96%E7%A8%8B/clickhouse/clickhouse%E7%B4%A2%E5%BC%95%E5%8E%9F%E7%90%86%E4%BB%8B%E7%BB%8D/

#### 主键索引
- 没有唯一约束
- 稀疏索引
- 由建表语句index_granularity指定索引粒度,默认8192

#### 分区索引
- 记录分区下分区字段对应原始数据的最小和最大值
- 查询语句指定分区字段时,通过该索引快速定位到分区

#### 跳数索引
- INDEX index_name expr TYPE type(...) GRANULARITY granularity_value,type:minmax, set, bloom_filter等
- minmax:指定一个值范围
- set(max_rows):保存表达式去重复后值,适用重复性高的字段
- bloom_filter([false_positive]):布隆过滤器,false_positive为误报率
- ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed):对String, FixedString 和 Map类型数据有效,可用于优化 EQUALS, LIKE 和 IN表达式。
- tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed):适用全文搜索

#### 稀疏索引
- 分区数据已经按order by字段排序,在这个基础上再创建索引(二级索引)

#### 参考
- 类型:https://clickhouse.com/docs/en/guides/improving-query-performance/skipping-indexes#skip-index-types
- 布隆过滤器参数计算:https://hur.st/bloomfilter/
- https://blog.csdn.net/haveanybody/article/details/123919938

## explain
#### 是否走索引
- explain indexes = 1
- https://www.modb.pro/db/161379

## 副本同步原理
![img.png](https://raw.githubusercontent.com/li-zeyuan/access/master/img/202210241830075.png){:height="10%" width="50%"}
- 1、client向某个server1发送写入请求
- 2、server1写入本地
- 3、同步operation log到zookeeper
- 4、其他server监听到operation log变化并拉取operation log
- 5、解析operation log,并从server1拉取数据
- 总结:
- 1、谁处理client请求,谁负责。负责同步operation log数据到zookeeper
- 2、zookeeper不参与实质的data数据传输,只负责log同步

## 分区操作
- 参考:https://clickhouse.com/docs/en/sql-reference/statements/alter/partition

## 数据压缩
- 数据类型存储:https://aop.pub/artical/database/clickhouse/datatype-storage/
- 压缩算法选型:https://blog.csdn.net/neweastsun/article/details/130974311;https://chistadata.com/compression-algorithms-and-codecs-in-clickhouse/
- 压缩算法:https://developer.aliyun.com/article/780586

## 总结
- clickhouse采用列式存储,适合OLAP场景
- 多种数据类型,不支持Bool,LowCardinality对数据类型进行二次编码
- 常用MergeTree序列表引擎,关键字段:partition by,primary key,order by,settings
- 索引:主键索引,分区索引,跳数索引
- 常用分区操作

## 参考
- 官方文档:https://clickhouse.com/docs/en/intro
- https://blog.csdn.net/qq_40378034/article/details/120256757
- BitMap及其在ClickHouse中的应用:https://zhuanlan.zhihu.com/p/480345952

48 changes: 48 additions & 0 deletions _posts/2024-03-29-clickhouse_troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
title: Clickhouse故障排查
date: 2024-03-29 00:00:00 +0800
categories: [root, clickhouse]
tags: [clickhouse,troubleshooting]
author: ahern
---

## 写入时间字段(不带时区)时区问题

#### 复现
1. insert 语句没有带时区,时间字段的值不会发生变化
```sql
insert into trace_names_all
values ('2022-07-15 00:00:08',
'test_service_name',
'test_span_name');
```

2. 修改clickhouse-server时区:UTC -> UTC+8
- 修改前:时区为UTC时插入的数据:
```sql
insert into trace_names_all
values ('2022-07-15 00:00:00',
'test_service_name',
'test_span_name');
```
- **修改时区后2022-07-15 00:00:00.000 变成 2022-07-15 08:00:00.000,加8小时**
┌───────────────timestamp─┬─serviceName───────┬─spanName───────┐
│ 2022-07-15 08:00:00.000 │ test_service_name │ test_span_name │
└─────────────────────────┴───────────────────┴────────────────┘
- 修改后:时区为UTC+8插入的数据:
```sql
insert into trace_names_all
values ('2022-07-15 00:00:08',
'test_service_name',
'test_span_name');
```
┌───────────────timestamp─┬─serviceName───────┬─spanName───────┐
│ 2022-07-15 00:00:08.000 │ test_service_name │ test_span_name │
└─────────────────────────┴───────────────────┴────────────────┘

#### 原因
插入时间字段时(不带时区),ch默认该字段时区为ch的时区。

#### 解决

Loading

0 comments on commit 2e85d95

Please sign in to comment.