chore: clean lz4 and zstd version dependency and fix docker readme.

cubefs · Dec 28, 2023 · 31bff26 · 31bff26
2 parents 6fe28bf + dbda6cb
commit 31bff26
Show file tree

Hide file tree

Showing 65 changed files with 1,141 additions and 589 deletions.
diff --git a/README.md b/README.md
@@ -2,29 +2,32 @@
 [Chinese Document](README_zh.md)
 
 ### Abstract
-Compass is a platform to diagnose computing engines and schedulers around big data ecosystem, which aims to improve the 
-efficiency of troubleshooting and reduce the complexity of problem tuning. It automatically gathers logs and metrics,
-runs with heuristic rules to identify problem and offers tuning advice.
+Compass is a platform for diagnosing computing engines and schedulers in the big data ecosystem, aiming to improve 
+the efficiency of troubleshooting and reduce the complexity of problem tuning. It automatically collects logs and
+metrics, and uses heuristic rules to identify problems and provide tuning advice. In addition, for logs, ChatGPT is 
+used to provide diagnostic suggestions. The logs are automatically aggregated into templates using the drain algorithm, 
+which can be used for manual intervention, etc., to improve the automation of diagnosis and optimization solutions.
 
 ### Feature
-1. Non-invasive, in-time diagnosis, no need to modify the original platform code
-2. Compatible with multiple version for different componts such Spark 2.4+、Flink 1.2+、Hadoop 2.4+, DolphinScheduler 2.x+, Airflow, etc
-3. Supports diagnostics for kinds of scheduling job issues, such as failure, abnormal elapsed time, abnormal baseline, etc
-4. Supports diagnostics for kinds of engine task issues, such as data skew, big table scan,  memory waste, long tail task, etc
-5. Supports diagnostics for capturing log exception and offers advise or solution
-
-### Engine Support
+1. Non-invasive, in-time diagnosis, no need to modify the original platform code.
+2. Compatible with multiple version for different componts such Spark 2.4+、Flink 1.2+、Hadoop 2.4+, DolphinScheduler 2.x+, Airflow, etc.
+3. Supports diagnostics for kinds of scheduling job issues, such as failure, abnormal elapsed time, abnormal baseline, etc.
+4. Supports diagnostics for kinds of engine task issues, such as data skew, big table scan,  memory waste, long tail task, etc.
+5. Supports diagnostics for capturing log exception and offers advise or solution.
+6. Supports ChatGPT to diagnose abnormal logs and provide solutions; uses the drain algorithm to aggregate templates, saving costs.
+
+### Feature Support
+- [x] ChatGPT
 - [x] Spark
 - [x] Flink
 - [x] Mapreduce
 - [ ] Trino
-- [ ] Other(Any suggestions are welcomed, high valued)...
-
-### Scheduler Support
+- [ ] Spark Tez
 - [x] Airflow
 - [x] DolphinScheduler
 - [ ] Azkaban
 - [ ] Oozie
+- [ ] Debezium (Synchronize Postgresql data to Postgresql)
 - [ ] Other(Any suggestions are welcomed, high valued)...
 
 ###  Documents
@@ -36,49 +39,49 @@ runs with heuristic rules to identify problem and offers tuning advice.
 ### Community
 Welcome to join the community for the usage or development of Compass.
 - Submit an [issue](https://github.com/cubefs/compass/issues).
-- Submit a pull request, please read the [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md),
-- Discuss [idea & question](https://github.com/cubefs/compass/discussions)
+- Submit a pull request, please read the [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md).
+- Discuss [idea & question](https://github.com/cubefs/compass/discussions).
 
 Usually We will reply it quickly.
 
 ### Categories of Diagnosis
 
-|   Category  |     Scope   |  Dimension |   Description      |
-|-------------|-------------|------------|-----------------|
-|Failed task  |Scheduler|Runtime Analysis|Fail to run task successfully after retrying per running cycle|
-|First failed task|Scheduler|Runtime Analysis|Fail to run task first time but succeed after retrying per running cycle 
-|Long-term failed task|Scheduler|Runtime Analysis|Keep failing to run task every running cycle|
-|Exceed base-time task|Scheduler|Time Analysis|The run ends earlier or later than normal|
-|Abnormal time-elapsed task|Scheduler|Time Analysis|The elapsed time of task is either too short or too long compared to the normal|
-|Long time-consuming task|Scheduler|Time Analysis|The elapsed time of task is exceed 2 hours|
-|Failed SQL task|Spark|Runtime Analysis|failed to run sql|
-|Shuffle failed task|Spark|Runtime Analysis|failed to run task due to being unable to shuffle successfully|
-|Memory Overflow|Spark|Runtime Analysis|There is not enough memory to run task|
-|CPU waste|Spark,MapReduce|Resource Analysis|The usage of CPU is not high|
-|Memory waste|Spark|Resource Analysis|The usage of Memory is not high|
-|Large table scan|Spark,MapReduce|Efficiency Analysis|Scan too many rows of large table due to no partitions or no filters|
-|Memory overflow warning|Spark|Efficiency Analysis|The size or rows of data broadcast from driver to executor is too many, which may cause memory overflow|
-|Data skew|Spark,MapReduce|Efficiency Analysis|The maximum data each processing unit(task/map/reduce) is larger than the median|
-|Abnormal time-consuming job|Spark|Efficiency Analysis|There is a higher ratio of idle time during the run of the job |
-|Abnormal time-consuming stage|Spark|Efficiency Analysis|There is a higher ratio of idle time during the run of the stage|
-|Long tail task|Spark,MapReduce|Efficiency Analysis|The maximum running time of a processing unit(task/map/reduce) is much larger than the median|
-|Hdfs read/write stuck|Spark|Efficiency Analysis|The rate of processing data each task is much slower than that in a normal stage|
-|Speculative tasks|Spark,MapReduce|Efficiency Analysis|There are too many speculative tasks because of the executor is processing slowly|
-|Abnormal global sort|Spark|Efficiency Analysis|The whole Spark application contains only one task|
-|Abnormal gc|MapReduce|Efficiency Analysis|There is a higher ratio gc time compared to CPU time|
-|High memory usage|Flink|Resource Analysis|The usage of the memory is high|
-|Low memory usage|Flink|Resource Analysis|The usage of the memory is low|
-|Abnormal jobmanager memory|Flink|Resource Analysis|The memory of jobmanager is abnormal if there is too many taskmanager|
-|No data processing|Flink|Resource Analysis|There is no data processing in a job|
-|No data in partial task|Flink|Resource Analysis|There is no data processing in partial taskmanagers|
-|Optimize taskmanager memory|Flink|Resource Analysis|Optimize the memory of taskmanager due to the abnormal memory given|
-|Not enough Parallel|Flink|Resource Analysis|There is less parallel for flink job|
-|High CPU usage|Flink|Resource Analysis|The usage of the CPU is high|
-|Low CPU usage|Flink|Resource Analysis|The usage of the CPU is low|
-|High Maximum CPU usage|Flink|Resource Analysis|The peek of the CPU is high|
-|Slow operators|Flink|Runtime Analysis|There are slow operators in a flink job|
-|Back pressure|Flink|Runtime Analysis|There is back pressure in a flink job|
-|High delay|Flink|Runtime Analysis|There is high delay in a flink job|
+| Category                      | Scope           | Dimension           | Description                                                                                             |
+|-------------------------------|-----------------|---------------------|---------------------------------------------------------------------------------------------------------|
+| Failed task                   | Scheduler       | Runtime Analysis    | Fail to run task successfully after retrying per running cycle                                          |
+| First failed task             | Scheduler       | Runtime Analysis    | Fail to run task first time but succeed after retrying per running cycle                                |
+| Long-term failed task         | Scheduler       | Runtime Analysis    | Keep failing to run task every running cycle                                                            |
+| Exceed base-time task         | Scheduler       | Time Analysis       | The run ends earlier or later than normal                                                               |
+| Abnormal time-elapsed task    | Scheduler       | Time Analysis       | The elapsed time of task is either too short or too long compared to the normal                         |
+| Long time-consuming task      | Scheduler       | Time Analysis       | The elapsed time of task is exceed 2 hours                                                              |
+| Failed SQL task               | Spark           | Runtime Analysis    | Failed to run sql                                                                                       |
+| Shuffle failed task           | Spark           | Runtime Analysis    | Failed to run task due to being unable to shuffle successfully                                          |
+| Memory Overflow               | Spark           | Runtime Analysis    | There is not enough memory to run task                                                                  |
+| CPU waste                     | Spark,MapReduce | Resource Analysis   | The usage of CPU is not high                                                                            |
+| Memory waste                  | Spark           | Resource Analysis   | The usage of Memory is not high                                                                         |
+| Large table scan              | Spark,MapReduce | Efficiency Analysis | Scan too many rows of large table due to no partitions or no filters                                    |
+| Memory overflow warning       | Spark           | Efficiency Analysis | The size or rows of data broadcast from driver to executor is too many, which may cause memory overflow |
+| Data skew                     | Spark,MapReduce | Efficiency Analysis | The maximum data each processing unit(task/map/reduce) is larger than the median                        |
+| Abnormal time-consuming job   | Spark           | Efficiency Analysis | There is a higher ratio of idle time during the run of the job                                          |
+| Abnormal time-consuming stage | Spark           | Efficiency Analysis | There is a higher ratio of idle time during the run of the stage                                        |
+| Long tail task                | Spark,MapReduce | Efficiency Analysis | The maximum running time of a processing unit(task/map/reduce) is much larger than the median           |
+| Hdfs read/write stuck         | Spark           | Efficiency Analysis | The rate of processing data each task is much slower than that in a normal stage                        |
+| Speculative tasks             | Spark,MapReduce | Efficiency Analysis | There are too many speculative tasks because of the executor is processing slowly                       |
+| Abnormal global sort          | Spark           | Efficiency Analysis | The whole Spark application contains only one task                                                      |
+| Abnormal gc                   | MapReduce       | Efficiency Analysis | There is a higher ratio gc time compared to CPU time                                                    |
+| High memory usage             | Flink           | Resource Analysis   | The usage of the memory is high                                                                         |
+| Low memory usage              | Flink           | Resource Analysis   | The usage of the memory is low                                                                          |
+| Abnormal jobmanager memory    | Flink           | Resource Analysis   | The memory of jobmanager is abnormal if there is too many taskmanager                                   |
+| No data processing            | Flink           | Resource Analysis   | There is no data processing in a job                                                                    |
+| No data in partial task       | Flink           | Resource Analysis   | There is no data processing in partial taskmanagers                                                     |
+| Optimize taskmanager memory   | Flink           | Resource Analysis   | Optimize the memory of taskmanager due to the abnormal memory given                                     |
+| Not enough Parallel           | Flink           | Resource Analysis   | There is less parallel for flink job                                                                    |
+| High CPU usage                | Flink           | Resource Analysis   | The usage of the CPU is high                                                                            |
+| Low CPU usage                 | Flink           | Resource Analysis   | The usage of the CPU is low                                                                             |
+| High Maximum CPU usage        | Flink           | Resource Analysis   | The peek of the CPU is high                                                                             |
+| Slow operators                | Flink           | Runtime Analysis    | There are slow operators in a flink job                                                                 |
+| Back pressure                 | Flink           | Runtime Analysis    | There is back pressure in a flink job                                                                   |
+| High delay                    | Flink           | Runtime Analysis    | There is high delay in a flink job                                                                      |
 
 
 

diff --git a/README_zh.md b/README_zh.md
@@ -2,7 +2,9 @@
 
 [English document](README.md)
 
-罗盘是一个大数据任务诊断平台，旨在提升用户排查问题效率，降低用户异常任务成本。
+Compass是一个诊断大数据生态系统中计算引擎和调度器的平台，旨在提高故障排除的效率并降低问题调整的复杂性。
+它自动收集日志和指标，除了使用启发式规则来识别问题并提供调整建议，对于日志，还使用了ChatGPT还提供诊断建议，
+日志将使用drain算法自动聚合为模板，可用于人工干预等，提升诊断自动化和优化方案能力。
 
 其主要功能特性如下：
 
@@ -12,7 +14,22 @@
 - 支持工作流层异常诊断，识别各种失败和基线耗时异常问题。
 - 支持引擎层异常诊断，包含数据倾斜、大表扫描、内存浪费等14种异常类型。
 - 支持各种日志匹配规则编写和异常阈值调整，可自行根据实际场景优化。
-- 支持一键诊断全量(包含非调度平台提交任务)Spark/MapReduce任务
+- 支持一键诊断全量(包含非调度平台提交任务)Spark/MapReduce任务。
+- 支持ChatGPT对异常日志进行诊断，提供解决方案，使用了drain算法聚合模板，节约成本。
+
+## 支持组件
+- [x] ChatGPT
+- [x] Spark
+- [x] Flink
+- [x] Mapreduce
+- [ ] Trino
+- [ ] Spark Tez
+- [x] Airflow
+- [x] DolphinScheduler
+- [ ] Azkaban
+- [ ] Oozie
+- [ ] Debezium (同步Postgresql到Postgresql的数据同步)
+- [ ] Other(我们非常欢迎与倾听其他任务建设性意见)...
 
 ## 文档
 
@@ -24,9 +41,9 @@
 
 欢迎加入社区咨询使用或成为 Compass 开发者。以下是获得帮助的方法：
 
-- 提交 [issue](https://github.com/cubefs/compass/issues).
-- 提交 pull request, 请阅读 [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md),
-- 讨论 [Idea & Question](https://github.com/cubefs/compass/discussions).
+- 提交 [issue](https://github.com/cubefs/compass/issues)。
+- 提交 pull request, 请阅读 [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md)。
+- 讨论 [Idea & Question](https://github.com/cubefs/compass/discussions)。
 
 我们将会尽快回复。