update docs and readme for release 2.4 (#328)

Signed-off-by: chenxu <[email protected]> Co-authored-by: chenxu <[email protected]>
lakesoul-io · Sep 15, 2023 · 582508b · 582508b
1 parent a930c09
commit 582508b
Show file tree

Hide file tree

Showing 15 changed files with 56 additions and 46 deletions.
diff --git a/README-CN.md b/README-CN.md
@@ -8,7 +8,7 @@ SPDX-License-Identifier: Apache-2.0
 
 <img src='https://github.com/lfai/artwork/blob/main/lfaidata-assets/lfaidata-project-badge/sandbox/color/lfaidata-project-badge-sandbox-color.svg' alt="LF AI & Data Sandbox Project" height='180'>
 
-LakeSoul 是一款开源云原生湖仓一体框架，具备高可扩展的元数据管理、ACID 事务、高效灵活的 upsert 操作、Schema 演进和批流一体化处理等特性。
+LakeSoul 是一款开源云原生湖仓一体框架，具备高可扩展的元数据管理、ACID 事务、高效灵活的 upsert 操作、Schema 演进和批流一体化处理等特性。LakeSoul 支持多种计算引擎读写湖仓表数据，包括 Spark、Flink、Presto、PyTorch，支持批、流、MPP、AI 多种计算模式。LakeSoul 支持 HDFS、S3 等存储系统。
 ![LakeSoul 架构](website/static/img/lakeSoulModel.png)
 
 LakeSoul 由数元灵科技研发并于 2023 年 5 月正式捐赠给 Linux Foundation AI & Data 基金会，成为基金会旗下 Sandbox 孵化项目。
@@ -17,11 +17,14 @@ LakeSoul 专门为数据湖云存储之上的数据进行行、列级别增量
 
 LakeSoul 通过类似 LSM-Tree 的方式在哈希分区主键 upsert 场景支持了高性能的写吞吐能力。同时高度优化的 Merge on Read 实现也保证了读性能（参考 [性能对比](https://lakesoul-io.github.io/zh-Hans/blog/2023/04/21/lakesoul-2.2.0-release)）。LakeSoul 通过 PostgreSQL 来管理元数据，实现元数据的高可扩展性和高并发事物能力。
 
+LakeSoul 使用 Rust 实现了 native 的元数据层和 IO 层，并封装了 C/Java/Python 接口，从而能够支持大数据和 AI 等多种计算框架对接。
+
 LakeSoul 支持流、批并发读写，读写全面兼容 CDC 语义，通过自动 Schema 演进和严格一次语义等功能，能够轻松构建全链路流式数仓。
 
 更多特性和其他产品对比请参考：[特性介绍](https://lakesoul-io.github.io/zh-Hans/docs/intro)
 
 # 使用教程
+* [湖仓对接 AI：使用 Python 进行数据预处理和模型训练](https://github.com/lakesoul-io/LakeSoul/tree/main/python/examples)：LakeSoul 将湖仓和 AI 无缝衔接，打造 Data+AI 的现代数据架构。
 * [CDC 整库入湖教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/flink-cdc-sink): LakeSoul 通过 Flink CDC 实现 MySQL 等多种数据库的整库同步，支持自动建表、自动 DDL 变更、严格一次（exactly once）保证。
 * [Flink SQL 教程](https://lakesoul-io.github.io/zh-Hans/docs/Usage%20Docs/flink-lakesoul-connector)：LakeSoul 支持 Flink 流、批读写。流式读写完整支持 Flink Changelog 语义，支持行级别流式增删改。
 * [多流合并构建宽表教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/mutil-stream-merge)：LakeSoul 原生支持多个具有相同主键的流（其余列可以不同）自动合并到同一张表，消除 Join.

diff --git a/README.md b/README.md
@@ -17,6 +17,9 @@ SPDX-License-Identifier: Apache-2.0
 [中文介绍](README-CN.md)
 
 LakeSoul is a cloud-native Lakehouse framework that supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and unified streaming & batch processing.
+
+LakeSoul supports multiple computing engines to read and write lake warehouse table data, including Spark, Flink, Presto, and PyTorch, and supports multiple computing modes such as batch, stream, MPP, and AI. LakeSoul supports storage systems such as HDFS and S3.
+
 ![LakeSoul Arch](website/static/img/lakeSoulModel.png)
 
 LakeSoul was originally created by DMetaSoul company and was donated to Linux Foundation AI & Data as a sandbox project since May 2023.
@@ -25,6 +28,8 @@ LakeSoul implements incremental upserts for both row and column and allows concu
 
 LakeSoul uses LSM-Tree like structure to support updates on hash partitioning table with primary key, and achieves very high write throughput while providing optimized merge on read performance (refer to [Performance Benchmarks](https://lakesoul-io.github.io/blog/2023/04/21/lakesoul-2.2.0-release)). LakeSoul scales metadata management and achieves ACID control by using PostgreSQL.
 
+LakeSoul uses Rust to implement the native metadata layer and IO layer, and provides C/Java/Python interfaces to support the connecting of multiple computing frameworks such as big data and AI.
+
 LakeSoul supports concurrent batch or streaming read and write. Both read and write supports CDC semantics, and together with auto schema evolution and exacly-once guarantee, constructing realtime data warehouses is made easy.
 
 More detailed features please refer to our doc page: [Documentations](https://lakesoul-io.github.io/docs/intro)
@@ -35,6 +40,7 @@ Follow the [Quick Start](https://lakesoul-io.github.io/docs/Getting%20Started/se
 # Tutorials
 Please find tutorials in doc site:
 
+* Checkout [Examples of Python Data Processing and AI Model Training on LakeSoul](https://github.com/lakesoul-io/LakeSoul/tree/main/python/examples) on how LakeSoul connecting AI to Lakehouse to build a unified and modern data infrastructure.
 * Checkout [LakeSoul Flink CDC Whole Database Synchronization Tutorial](https://lakesoul-io.github.io/docs/Tutorials/flink-cdc-sink) on how to sync an entire MySQL database into LakeSoul in realtime, with auto table creation, auto DDL sync and exactly once guarantee.
 * Checkout [Flink SQL Usage](https://lakesoul-io.github.io/docs/Usage%20Docs/flink-lakesoul-connector) on using Flink SQL to read or write LakeSoul in both batch and streaming mode, with the supports of Flink Changelog Stream semantics and row-level upsert and delete.
 * Checkout [Multi Stream Merge and Build Wide Table Tutorial](https://lakesoul-io.github.io/docs/Tutorials/mutil-stream-merge) on how to merge multiple stream with same primary key (and different other columns) concurrently without join.
@@ -53,6 +59,9 @@ Please find usage documentations in doc site:
 [使用文档](https://lakesoul-io.github.io/zh-Hans/docs/Usage%20Docs/setup-meta-env)
 
 # Feature Roadmap
+* Data Science and AI
+  - [x] Native Python Reader (without PySpark)
+  - [x] PyTorch Dataset and distributed training
 * Meta Management ([#23](https://github.com/lakesoul-io/LakeSoul/issues/23))
   - [x] Multiple Level Partitioning: Multiple range partition and at most one hash partition
   - [x] Concurrent write with auto conflict resolution
@@ -74,8 +83,6 @@ Please find usage documentations in doc site:
   - [ ] Materialized View
     - [ ] Incremental MV Build
     - [ ] Auto query rewrite
-* Data Science
-  - [ ] Native Python Reader (without PySpark)
 * Spark Integration
   - [x] Table/Dataframe API
   - [x] SQL support with catalog except upsert

diff --git a/website/docs/01-Getting Started/01-setup-local-env.md b/website/docs/01-Getting Started/01-setup-local-env.md
@@ -47,7 +47,7 @@ After unpacking spark package, you could find LakeSoul distribution jar from htt
 wget https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/spark/spark-3.3.2-bin-hadoop-3.tgz
 tar xf spark-3.3.2-bin-hadoop-3.tgz
 export SPARK_HOME=${PWD}/spark-3.3.2-bin-hadoop3
-wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.3.0/lakesoul-spark-2.3.0-spark-3.3.jar -P $SPARK_HOME/jars
+wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-spark-2.4.0-spark-3.3.jar -P $SPARK_HOME/jars
 ```
 
 :::tip

diff --git a/website/docs/01-Getting Started/02-docker-compose.mdx b/website/docs/01-Getting Started/02-docker-compose.mdx
@@ -40,7 +40,7 @@ docker run --net lakesoul-docker-compose-env_default --rm -ti \
     -v $(pwd)/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \
     --env lakesoul_home=/opt/spark/work-dir/lakesoul.properties bitnami/spark:3.3.1 \
     spark-shell \
-    --packages com.dmetasoul:lakesoul-spark:2.3.0-spark-3.3 \
+    --packages com.dmetasoul:lakesoul-spark:2.4.0-spark-3.3 \
     --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \
     --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \
     --conf spark.sql.defaultCatalog=lakesoul \

diff --git a/website/docs/02-Tutorials/02-flink-cdc-sink/index.md b/website/docs/02-Tutorials/02-flink-cdc-sink/index.md
@@ -67,12 +67,12 @@ You can see that there is currently only one `default` database in LakeSoul, and
 ## 2. Start the sync job
 
 ### 2.1 Start a local Flink Cluster
-You can download [Flink 1.14.5](https://archive.apache.org/dist/flink/flink-1.14.5/flink-1.14.5-bin-scala_2.12.tgz) from the Flink download page.
+You can download from the Flink download page: [Flink 1.17](https://www.apache.org/dyn/closer.lua/flink/flink-1.17.1/flink-1.17.1-bin-scala_2.12.tgz)
 
 Unzip the downloaded Flink installation package:
 ```bash
-tar xf flink-1.14.5-bin-scala_2.12.tgz
-export FLINK_HOME=${PWD}/flink-1.14.5
+tar xf flink-1.17.1-bin-scala_2.12.tgz
+export FLINK_HOME=${PWD}/flink-1.17.1
 ````
 
 Then start a local Flink Cluster:
@@ -90,7 +90,7 @@ Submit a LakeSoul Flink CDC Sink job to the Flink cluster started above:
 ```bash
 ./bin/flink run -ys 1 -yjm 1G -ytm 2G \
    -c org.apache.flink.lakesoul.entry.MysqlCdc\
-   lakesoul-flink-2.3.0-flink-1.14.jar \
+   lakesoul-flink-2.3.0-flink-1.17.jar \
    --source_db.host localhost \
    --source_db.port 3306 \
    --source_db.db_name test_cdc \

diff --git a/website/docs/03-Usage Docs/02-setup-spark.md b/website/docs/03-Usage Docs/02-setup-spark.md
@@ -16,14 +16,14 @@ To use `spark-shell`, `pyspark` or `spark-sql` shells, you should include LakeSo
 
 #### Use Maven Coordinates via --packages
 ```bash
-spark-shell --packages com.dmetasoul:lakesoul-spark:2.3.0-spark-3.3
+spark-shell --packages com.dmetasoul:lakesoul-spark:2.4.0-spark-3.3
 ```
 
 #### Use Local Packages
 You can find the LakeSoul packages from our release page: [Releases](https://github.com/lakesoul-io/LakeSoul/releases).
 Download the jar file and pass it to `spark-submit`.
 ```bash
-spark-submit --jars "lakesoul-spark-2.3.0-spark-3.3.jar"
+spark-submit --jars "lakesoul-spark-2.4.0-spark-3.3.jar"
 ```
 
 Or you could directly put the jar into `$SPARK_HOME/jars`
@@ -34,7 +34,7 @@ Include maven dependencies in your project:
 <dependency>
     <groupId>com.dmetasoul</groupId>
     <artifactId>lakesoul</artifactId>
-    <version>2.3.0-spark-3.3</version>
+    <version>2.4.0-spark-3.3</version>
 </dependency>
 ```
 
@@ -93,7 +93,7 @@ spark.sql.sources.default lakesoul
 ## Setup Flink Project or Job
 
 ### Required Flink Version
-Currently Flink 1.14 is supported.
+Since 2.4.0, Flink version 1.17 is supported.
 
 ### Setup Metadata Database Connection for Flink
 
@@ -133,7 +133,7 @@ If access to the Hadoop environment is required, the Hadoop Classpath environmen
 ```bash
 export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
 ```
-For details, please refer to: [Flink on Hadoop](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/yarn/)
+For details, please refer to: [Flink on Hadoop](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/resource-providers/yarn/)
 :::
 
 :::tip
@@ -144,7 +144,7 @@ taskmanager.memory.task.off-heap.size: 3000m
 :::
 
 ### Add LakeSoul Jar to Flink's directory
-Download LakeSoul Flink Jar from: https://github.com/lakesoul-io/LakeSoul/releases/download/v2.3.0/lakesoul-flink-2.3.0-flink-1.14.jar
+Download LakeSoul Flink Jar from: https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-flink-2.4.0-flink-1.17.jar
 
 And put the jar file under `$FLINK_HOME/lib`. After this, you could start flink session cluster or application as usual.
 
@@ -155,6 +155,6 @@ Add the following to your project's pom.xml
 <dependency>
     <groupId>com.dmetasoul</groupId>
     <artifactId>lakesoul</artifactId>
-    <version>2.3.0-flink-1.14</version>
+    <version>2.4.0-flink-1.17</version>
 </dependency>
 ```
diff --git a/website/docs/03-Usage Docs/05-flink-cdc-sync.md b/website/docs/03-Usage Docs/05-flink-cdc-sync.md
@@ -21,9 +21,9 @@ In the Stream API, the main functions of LakeSoul Sink are:
 
 ## How to use the command line
 ### 1. Download LakeSoul Flink Jar
-It can be downloaded from the LakeSoul Release page: https://github.com/lakesoul-io/LakeSoul/releases/download/v2.3.0/lakesoul-flink-2.3.0-flink-1.14.jar.
+It can be downloaded from the LakeSoul Release page: https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-flink-2.4.0-flink-1.17.jar.
 
-The currently supported Flink version is 1.14.
+The currently supported Flink version is 1.17.
 
 ### 2. Start the Flink job
 
@@ -60,7 +60,7 @@ export LAKESOUL_PG_PASSWORD=root
 #### 2.2 Start sync job
 ```bash
 bin/flink run -c org.apache.flink.lakesoul.entry.MysqlCdc \
-    lakesoul-flink-2.3.0-flink-1.14.jar \
+    lakesoul-flink-2.4.0-flink-1.17.jar \
     --source_db.host localhost \
     --source_db.port 3306 \
     --source_db.db_name default \
@@ -79,7 +79,7 @@ Description of required parameters:
 | Parameter | Meaning | Value Description |
 |----------------|------------------------------------|-------------------------------------------- |
 | -c | The task runs the main function entry class | org.apache.flink.lakesoul.entry.MysqlCdc |
-| Main package | Task running jar | lakesoul-flink-2.3.0-flink-1.14.jar |
+| Main package | Task running jar | lakesoul-flink-2.4.0-flink-1.17.jar |
 | --source_db.host | The address of the MySQL database | |
 | --source_db.port | MySQL database port | |
 | --source_db.user | MySQL database username | |

diff --git a/website/docs/03-Usage Docs/06-flink-lakesoul-connector.md b/website/docs/03-Usage Docs/06-flink-lakesoul-connector.md
@@ -16,14 +16,14 @@ LakeSoul provides Flink Connector which implements the Dynamic Table interface,
 
 To setup Flink environment, please refer to [Setup Spark/Flink Job/Project](../03-Usage%20Docs/02-setup-spark.md)
 
-Introduce LakeSoul dependency: package and compile the lakesoul-flink folder to get lakesoul-flink-2.3.0-flink-1.14.jar.
+Introduce LakeSoul dependency: package and compile the lakesoul-flink folder to get lakesoul-flink-2.4.0-flink-1.17.jar.
 
 In order to use Flink to create LakeSoul tables, it is recommended to use Flink SQL Client, which supports direct use of Flink SQL commands to operate LakeSoul tables. In this document, the Flink SQL is to directly enter statements on the Flink SQL Client cli interface; whereas the Table API needs to be used in a Java projects.
 
 Switch to the flink folder and execute the command to start the SQLclient client.
 ```bash
 # Start Flink SQL Client
-bin/sql-client.sh embedded -j lakesoul-flink-2.3.0-flink-1.14.jar
+bin/sql-client.sh embedded -j lakesoul-flink-2.4.0-flink-1.14.jar
 ```
 
 ## 2. DDL

diff --git a/...docusaurus-plugin-content-docs/current/01-Getting Started/01-setup-local-env.md b/...docusaurus-plugin-content-docs/current/01-Getting Started/01-setup-local-env.md
@@ -37,10 +37,10 @@ https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-without-hadoop.tgz
 
 LakeSoul 发布 jar 包可以从 GitHub Releases 页面下载：https://github.com/lakesoul-io/LakeSoul/releases 。下载后请将 Jar 包放到 Spark 安装目录下的 jars 目录中：
 ```bash
-wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.3.0/lakesoul-spark-2.3.0-spark-3.3.jar -P $SPARK_HOME/jars
+wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-spark-2.4.0-spark-3.3.jar -P $SPARK_HOME/jars
 ```
 
-如果访问 Github 有问题，也可以从如下链接下载：https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-spark-2.3.0-spark-3.3.jar
+如果访问 Github 有问题，也可以从如下链接下载：https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-spark-2.4.0-spark-3.3.jar
 
 :::tip
 从 2.1.0 版本起，LakeSoul 自身的依赖已经通过 shade 方式打包到一个 jar 包中。之前的版本是多个 jar 包以 tar.gz 压缩包的形式发布。

diff --git a/...docusaurus-plugin-content-docs/current/01-Getting Started/02-docker-compose.mdx b/...docusaurus-plugin-content-docs/current/01-Getting Started/02-docker-compose.mdx
@@ -40,7 +40,7 @@ docker run --net lakesoul-docker-compose-env_default --rm -ti \
     -v $(pwd)/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \
     --env lakesoul_home=/opt/spark/work-dir/lakesoul.properties bitnami/spark:3.3.1 \
     spark-shell \
-    --packages com.dmetasoul:lakesoul-spark:2.3.0-spark-3.3 \
+    --packages com.dmetasoul:lakesoul-spark:2.4.0-spark-3.3 \
     --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \
     --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \
     --conf spark.sql.defaultCatalog=lakesoul \

diff --git a/.../docusaurus-plugin-content-docs/current/02-Tutorials/02-flink-cdc-sink/index.md b/.../docusaurus-plugin-content-docs/current/02-Tutorials/02-flink-cdc-sink/index.md
@@ -67,12 +67,12 @@ SHOW TABLES IN default;
 ## 2. 启动同步作业
 
 ### 2.1 启动一个本地的 Flink Cluster
-可以从 Flink 下载页面下载 [Flink 1.14.5](https://archive.apache.org/dist/flink/flink-1.14.5/flink-1.14.5-bin-scala_2.12.tgz)，也可以从我们的[国内镜像地址下载](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/flink-1.14.5-bin-scala_2.12.tgz)（与Apache官网完全相同）。
+可以从 Flink 下载页面下载 [Flink 1.17](https://www.apache.org/dyn/closer.lua/flink/flink-1.17.1/flink-1.17.1-bin-scala_2.12.tgz)。
 
 解压下载的 Flink 安装包：
 ```bash
-tar xf flink-1.14.5-bin-scala_2.12.tgz
-export FLINK_HOME=${PWD}/flink-1.14.5
+tar xf flink-1.17.1-bin-scala_2.12.tgz
+export FLINK_HOME=${PWD}/flink-1.17.1
 ```
 
 然后启动一个本地的 Flink Cluster：
@@ -90,7 +90,7 @@ $FLINK_HOME/bin/start-cluster.sh
 ```bash
 ./bin/flink run -ys 1 -yjm 1G -ytm 2G \
    -c org.apache.flink.lakesoul.entry.MysqlCdc \
-   lakesoul-flink-2.3.0-flink-1.14.jar \
+   lakesoul-flink-2.4.0-flink-1.17.jar \
    --source_db.host localhost \
    --source_db.port 3306 \
    --source_db.db_name test_cdc \
@@ -105,7 +105,7 @@ $FLINK_HOME/bin/start-cluster.sh
    --server_time_zone UTC
 ```
 
-其中 lakesoul-flink 的 jar 包可以从 [Github Release](https://github.com/lakesoul-io/LakeSoul/releases/) 页面下载。如果访问 Github 有问题，也可以通过这个链接下载：https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-flink-2.3.0-flink-1.14.jar
+其中 lakesoul-flink 的 jar 包可以从 [Github Release](https://github.com/lakesoul-io/LakeSoul/releases/) 页面下载。如果访问 Github 有问题，也可以通过这个链接下载：https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-flink-2.4.0-flink-1.17.jar
 
 在 http://localhost:8081 Flink 作业页面中，点击 Running Job，进入查看 LakeSoul 作业是否已经处于 `Running` 状态。