Skip to content

Commit

Permalink
chore: clean lz4 and zstd version dependency and fix docker readme.
Browse files Browse the repository at this point in the history
  • Loading branch information
zebozhuang committed Dec 28, 2023
2 parents 6fe28bf + dbda6cb commit 31bff26
Show file tree
Hide file tree
Showing 65 changed files with 1,141 additions and 589 deletions.
105 changes: 54 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,32 @@
[Chinese Document](README_zh.md)

### Abstract
Compass is a platform to diagnose computing engines and schedulers around big data ecosystem, which aims to improve the
efficiency of troubleshooting and reduce the complexity of problem tuning. It automatically gathers logs and metrics,
runs with heuristic rules to identify problem and offers tuning advice.
Compass is a platform for diagnosing computing engines and schedulers in the big data ecosystem, aiming to improve
the efficiency of troubleshooting and reduce the complexity of problem tuning. It automatically collects logs and
metrics, and uses heuristic rules to identify problems and provide tuning advice. In addition, for logs, ChatGPT is
used to provide diagnostic suggestions. The logs are automatically aggregated into templates using the drain algorithm,
which can be used for manual intervention, etc., to improve the automation of diagnosis and optimization solutions.

### Feature
1. Non-invasive, in-time diagnosis, no need to modify the original platform code
2. Compatible with multiple version for different componts such Spark 2.4+、Flink 1.2+、Hadoop 2.4+, DolphinScheduler 2.x+, Airflow, etc
3. Supports diagnostics for kinds of scheduling job issues, such as failure, abnormal elapsed time, abnormal baseline, etc
4. Supports diagnostics for kinds of engine task issues, such as data skew, big table scan, memory waste, long tail task, etc
5. Supports diagnostics for capturing log exception and offers advise or solution

### Engine Support
1. Non-invasive, in-time diagnosis, no need to modify the original platform code.
2. Compatible with multiple version for different componts such Spark 2.4+、Flink 1.2+、Hadoop 2.4+, DolphinScheduler 2.x+, Airflow, etc.
3. Supports diagnostics for kinds of scheduling job issues, such as failure, abnormal elapsed time, abnormal baseline, etc.
4. Supports diagnostics for kinds of engine task issues, such as data skew, big table scan, memory waste, long tail task, etc.
5. Supports diagnostics for capturing log exception and offers advise or solution.
6. Supports ChatGPT to diagnose abnormal logs and provide solutions; uses the drain algorithm to aggregate templates, saving costs.

### Feature Support
- [x] ChatGPT
- [x] Spark
- [x] Flink
- [x] Mapreduce
- [ ] Trino
- [ ] Other(Any suggestions are welcomed, high valued)...

### Scheduler Support
- [ ] Spark Tez
- [x] Airflow
- [x] DolphinScheduler
- [ ] Azkaban
- [ ] Oozie
- [ ] Debezium (Synchronize Postgresql data to Postgresql)
- [ ] Other(Any suggestions are welcomed, high valued)...

### Documents
Expand All @@ -36,49 +39,49 @@ runs with heuristic rules to identify problem and offers tuning advice.
### Community
Welcome to join the community for the usage or development of Compass.
- Submit an [issue](https://github.com/cubefs/compass/issues).
- Submit a pull request, please read the [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md),
- Discuss [idea & question](https://github.com/cubefs/compass/discussions)
- Submit a pull request, please read the [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md).
- Discuss [idea & question](https://github.com/cubefs/compass/discussions).

Usually We will reply it quickly.

### Categories of Diagnosis

| Category | Scope | Dimension | Description |
|-------------|-------------|------------|-----------------|
|Failed task |Scheduler|Runtime Analysis|Fail to run task successfully after retrying per running cycle|
|First failed task|Scheduler|Runtime Analysis|Fail to run task first time but succeed after retrying per running cycle
|Long-term failed task|Scheduler|Runtime Analysis|Keep failing to run task every running cycle|
|Exceed base-time task|Scheduler|Time Analysis|The run ends earlier or later than normal|
|Abnormal time-elapsed task|Scheduler|Time Analysis|The elapsed time of task is either too short or too long compared to the normal|
|Long time-consuming task|Scheduler|Time Analysis|The elapsed time of task is exceed 2 hours|
|Failed SQL task|Spark|Runtime Analysis|failed to run sql|
|Shuffle failed task|Spark|Runtime Analysis|failed to run task due to being unable to shuffle successfully|
|Memory Overflow|Spark|Runtime Analysis|There is not enough memory to run task|
|CPU waste|Spark,MapReduce|Resource Analysis|The usage of CPU is not high|
|Memory waste|Spark|Resource Analysis|The usage of Memory is not high|
|Large table scan|Spark,MapReduce|Efficiency Analysis|Scan too many rows of large table due to no partitions or no filters|
|Memory overflow warning|Spark|Efficiency Analysis|The size or rows of data broadcast from driver to executor is too many, which may cause memory overflow|
|Data skew|Spark,MapReduce|Efficiency Analysis|The maximum data each processing unit(task/map/reduce) is larger than the median|
|Abnormal time-consuming job|Spark|Efficiency Analysis|There is a higher ratio of idle time during the run of the job |
|Abnormal time-consuming stage|Spark|Efficiency Analysis|There is a higher ratio of idle time during the run of the stage|
|Long tail task|Spark,MapReduce|Efficiency Analysis|The maximum running time of a processing unit(task/map/reduce) is much larger than the median|
|Hdfs read/write stuck|Spark|Efficiency Analysis|The rate of processing data each task is much slower than that in a normal stage|
|Speculative tasks|Spark,MapReduce|Efficiency Analysis|There are too many speculative tasks because of the executor is processing slowly|
|Abnormal global sort|Spark|Efficiency Analysis|The whole Spark application contains only one task|
|Abnormal gc|MapReduce|Efficiency Analysis|There is a higher ratio gc time compared to CPU time|
|High memory usage|Flink|Resource Analysis|The usage of the memory is high|
|Low memory usage|Flink|Resource Analysis|The usage of the memory is low|
|Abnormal jobmanager memory|Flink|Resource Analysis|The memory of jobmanager is abnormal if there is too many taskmanager|
|No data processing|Flink|Resource Analysis|There is no data processing in a job|
|No data in partial task|Flink|Resource Analysis|There is no data processing in partial taskmanagers|
|Optimize taskmanager memory|Flink|Resource Analysis|Optimize the memory of taskmanager due to the abnormal memory given|
|Not enough Parallel|Flink|Resource Analysis|There is less parallel for flink job|
|High CPU usage|Flink|Resource Analysis|The usage of the CPU is high|
|Low CPU usage|Flink|Resource Analysis|The usage of the CPU is low|
|High Maximum CPU usage|Flink|Resource Analysis|The peek of the CPU is high|
|Slow operators|Flink|Runtime Analysis|There are slow operators in a flink job|
|Back pressure|Flink|Runtime Analysis|There is back pressure in a flink job|
|High delay|Flink|Runtime Analysis|There is high delay in a flink job|
| Category | Scope | Dimension | Description |
|-------------------------------|-----------------|---------------------|---------------------------------------------------------------------------------------------------------|
| Failed task | Scheduler | Runtime Analysis | Fail to run task successfully after retrying per running cycle |
| First failed task | Scheduler | Runtime Analysis | Fail to run task first time but succeed after retrying per running cycle |
| Long-term failed task | Scheduler | Runtime Analysis | Keep failing to run task every running cycle |
| Exceed base-time task | Scheduler | Time Analysis | The run ends earlier or later than normal |
| Abnormal time-elapsed task | Scheduler | Time Analysis | The elapsed time of task is either too short or too long compared to the normal |
| Long time-consuming task | Scheduler | Time Analysis | The elapsed time of task is exceed 2 hours |
| Failed SQL task | Spark | Runtime Analysis | Failed to run sql |
| Shuffle failed task | Spark | Runtime Analysis | Failed to run task due to being unable to shuffle successfully |
| Memory Overflow | Spark | Runtime Analysis | There is not enough memory to run task |
| CPU waste | Spark,MapReduce | Resource Analysis | The usage of CPU is not high |
| Memory waste | Spark | Resource Analysis | The usage of Memory is not high |
| Large table scan | Spark,MapReduce | Efficiency Analysis | Scan too many rows of large table due to no partitions or no filters |
| Memory overflow warning | Spark | Efficiency Analysis | The size or rows of data broadcast from driver to executor is too many, which may cause memory overflow |
| Data skew | Spark,MapReduce | Efficiency Analysis | The maximum data each processing unit(task/map/reduce) is larger than the median |
| Abnormal time-consuming job | Spark | Efficiency Analysis | There is a higher ratio of idle time during the run of the job |
| Abnormal time-consuming stage | Spark | Efficiency Analysis | There is a higher ratio of idle time during the run of the stage |
| Long tail task | Spark,MapReduce | Efficiency Analysis | The maximum running time of a processing unit(task/map/reduce) is much larger than the median |
| Hdfs read/write stuck | Spark | Efficiency Analysis | The rate of processing data each task is much slower than that in a normal stage |
| Speculative tasks | Spark,MapReduce | Efficiency Analysis | There are too many speculative tasks because of the executor is processing slowly |
| Abnormal global sort | Spark | Efficiency Analysis | The whole Spark application contains only one task |
| Abnormal gc | MapReduce | Efficiency Analysis | There is a higher ratio gc time compared to CPU time |
| High memory usage | Flink | Resource Analysis | The usage of the memory is high |
| Low memory usage | Flink | Resource Analysis | The usage of the memory is low |
| Abnormal jobmanager memory | Flink | Resource Analysis | The memory of jobmanager is abnormal if there is too many taskmanager |
| No data processing | Flink | Resource Analysis | There is no data processing in a job |
| No data in partial task | Flink | Resource Analysis | There is no data processing in partial taskmanagers |
| Optimize taskmanager memory | Flink | Resource Analysis | Optimize the memory of taskmanager due to the abnormal memory given |
| Not enough Parallel | Flink | Resource Analysis | There is less parallel for flink job |
| High CPU usage | Flink | Resource Analysis | The usage of the CPU is high |
| Low CPU usage | Flink | Resource Analysis | The usage of the CPU is low |
| High Maximum CPU usage | Flink | Resource Analysis | The peek of the CPU is high |
| Slow operators | Flink | Runtime Analysis | There are slow operators in a flink job |
| Back pressure | Flink | Runtime Analysis | There is back pressure in a flink job |
| High delay | Flink | Runtime Analysis | There is high delay in a flink job |



Expand Down
27 changes: 22 additions & 5 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

[English document](README.md)

罗盘是一个大数据任务诊断平台,旨在提升用户排查问题效率,降低用户异常任务成本。
Compass是一个诊断大数据生态系统中计算引擎和调度器的平台,旨在提高故障排除的效率并降低问题调整的复杂性。
它自动收集日志和指标,除了使用启发式规则来识别问题并提供调整建议,对于日志,还使用了ChatGPT还提供诊断建议,
日志将使用drain算法自动聚合为模板,可用于人工干预等,提升诊断自动化和优化方案能力。

其主要功能特性如下:

Expand All @@ -12,7 +14,22 @@
- 支持工作流层异常诊断,识别各种失败和基线耗时异常问题。
- 支持引擎层异常诊断,包含数据倾斜、大表扫描、内存浪费等14种异常类型。
- 支持各种日志匹配规则编写和异常阈值调整,可自行根据实际场景优化。
- 支持一键诊断全量(包含非调度平台提交任务)Spark/MapReduce任务
- 支持一键诊断全量(包含非调度平台提交任务)Spark/MapReduce任务。
- 支持ChatGPT对异常日志进行诊断,提供解决方案,使用了drain算法聚合模板,节约成本。

## 支持组件
- [x] ChatGPT
- [x] Spark
- [x] Flink
- [x] Mapreduce
- [ ] Trino
- [ ] Spark Tez
- [x] Airflow
- [x] DolphinScheduler
- [ ] Azkaban
- [ ] Oozie
- [ ] Debezium (同步Postgresql到Postgresql的数据同步)
- [ ] Other(我们非常欢迎与倾听其他任务建设性意见)...

## 文档

Expand All @@ -24,9 +41,9 @@

欢迎加入社区咨询使用或成为 Compass 开发者。以下是获得帮助的方法:

- 提交 [issue](https://github.com/cubefs/compass/issues).
- 提交 pull request, 请阅读 [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md),
- 讨论 [Idea & Question](https://github.com/cubefs/compass/discussions).
- 提交 [issue](https://github.com/cubefs/compass/issues)
- 提交 pull request, 请阅读 [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md)
- 讨论 [Idea & Question](https://github.com/cubefs/compass/discussions)

我们将会尽快回复。

Expand Down
Loading

0 comments on commit 31bff26

Please sign in to comment.