Skip to content

Commit

Permalink
update docs and readme for release 2.4 (#328)
Browse files Browse the repository at this point in the history
Signed-off-by: chenxu <[email protected]>
Co-authored-by: chenxu <[email protected]>
  • Loading branch information
xuchen-plus and dmetasoul01 authored Sep 15, 2023
1 parent a930c09 commit 582508b
Show file tree
Hide file tree
Showing 15 changed files with 56 additions and 46 deletions.
5 changes: 4 additions & 1 deletion README-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ SPDX-License-Identifier: Apache-2.0

<img src='https://github.com/lfai/artwork/blob/main/lfaidata-assets/lfaidata-project-badge/sandbox/color/lfaidata-project-badge-sandbox-color.svg' alt="LF AI & Data Sandbox Project" height='180'>

LakeSoul 是一款开源云原生湖仓一体框架,具备高可扩展的元数据管理、ACID 事务、高效灵活的 upsert 操作、Schema 演进和批流一体化处理等特性。
LakeSoul 是一款开源云原生湖仓一体框架,具备高可扩展的元数据管理、ACID 事务、高效灵活的 upsert 操作、Schema 演进和批流一体化处理等特性。LakeSoul 支持多种计算引擎读写湖仓表数据,包括 Spark、Flink、Presto、PyTorch,支持批、流、MPP、AI 多种计算模式。LakeSoul 支持 HDFS、S3 等存储系统。
![LakeSoul 架构](website/static/img/lakeSoulModel.png)

LakeSoul 由数元灵科技研发并于 2023 年 5 月正式捐赠给 Linux Foundation AI & Data 基金会,成为基金会旗下 Sandbox 孵化项目。
Expand All @@ -17,11 +17,14 @@ LakeSoul 专门为数据湖云存储之上的数据进行行、列级别增量

LakeSoul 通过类似 LSM-Tree 的方式在哈希分区主键 upsert 场景支持了高性能的写吞吐能力。同时高度优化的 Merge on Read 实现也保证了读性能(参考 [性能对比](https://lakesoul-io.github.io/zh-Hans/blog/2023/04/21/lakesoul-2.2.0-release))。LakeSoul 通过 PostgreSQL 来管理元数据,实现元数据的高可扩展性和高并发事物能力。

LakeSoul 使用 Rust 实现了 native 的元数据层和 IO 层,并封装了 C/Java/Python 接口,从而能够支持大数据和 AI 等多种计算框架对接。

LakeSoul 支持流、批并发读写,读写全面兼容 CDC 语义,通过自动 Schema 演进和严格一次语义等功能,能够轻松构建全链路流式数仓。

更多特性和其他产品对比请参考:[特性介绍](https://lakesoul-io.github.io/zh-Hans/docs/intro)

# 使用教程
* [湖仓对接 AI:使用 Python 进行数据预处理和模型训练](https://github.com/lakesoul-io/LakeSoul/tree/main/python/examples):LakeSoul 将湖仓和 AI 无缝衔接,打造 Data+AI 的现代数据架构。
* [CDC 整库入湖教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/flink-cdc-sink): LakeSoul 通过 Flink CDC 实现 MySQL 等多种数据库的整库同步,支持自动建表、自动 DDL 变更、严格一次(exactly once)保证。
* [Flink SQL 教程](https://lakesoul-io.github.io/zh-Hans/docs/Usage%20Docs/flink-lakesoul-connector):LakeSoul 支持 Flink 流、批读写。流式读写完整支持 Flink Changelog 语义,支持行级别流式增删改。
* [多流合并构建宽表教程](https://lakesoul-io.github.io/zh-Hans/docs/Tutorials/mutil-stream-merge):LakeSoul 原生支持多个具有相同主键的流(其余列可以不同)自动合并到同一张表,消除 Join.
Expand Down
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ SPDX-License-Identifier: Apache-2.0
[中文介绍](README-CN.md)

LakeSoul is a cloud-native Lakehouse framework that supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and unified streaming & batch processing.

LakeSoul supports multiple computing engines to read and write lake warehouse table data, including Spark, Flink, Presto, and PyTorch, and supports multiple computing modes such as batch, stream, MPP, and AI. LakeSoul supports storage systems such as HDFS and S3.

![LakeSoul Arch](website/static/img/lakeSoulModel.png)

LakeSoul was originally created by DMetaSoul company and was donated to Linux Foundation AI & Data as a sandbox project since May 2023.
Expand All @@ -25,6 +28,8 @@ LakeSoul implements incremental upserts for both row and column and allows concu

LakeSoul uses LSM-Tree like structure to support updates on hash partitioning table with primary key, and achieves very high write throughput while providing optimized merge on read performance (refer to [Performance Benchmarks](https://lakesoul-io.github.io/blog/2023/04/21/lakesoul-2.2.0-release)). LakeSoul scales metadata management and achieves ACID control by using PostgreSQL.

LakeSoul uses Rust to implement the native metadata layer and IO layer, and provides C/Java/Python interfaces to support the connecting of multiple computing frameworks such as big data and AI.

LakeSoul supports concurrent batch or streaming read and write. Both read and write supports CDC semantics, and together with auto schema evolution and exacly-once guarantee, constructing realtime data warehouses is made easy.

More detailed features please refer to our doc page: [Documentations](https://lakesoul-io.github.io/docs/intro)
Expand All @@ -35,6 +40,7 @@ Follow the [Quick Start](https://lakesoul-io.github.io/docs/Getting%20Started/se
# Tutorials
Please find tutorials in doc site:

* Checkout [Examples of Python Data Processing and AI Model Training on LakeSoul](https://github.com/lakesoul-io/LakeSoul/tree/main/python/examples) on how LakeSoul connecting AI to Lakehouse to build a unified and modern data infrastructure.
* Checkout [LakeSoul Flink CDC Whole Database Synchronization Tutorial](https://lakesoul-io.github.io/docs/Tutorials/flink-cdc-sink) on how to sync an entire MySQL database into LakeSoul in realtime, with auto table creation, auto DDL sync and exactly once guarantee.
* Checkout [Flink SQL Usage](https://lakesoul-io.github.io/docs/Usage%20Docs/flink-lakesoul-connector) on using Flink SQL to read or write LakeSoul in both batch and streaming mode, with the supports of Flink Changelog Stream semantics and row-level upsert and delete.
* Checkout [Multi Stream Merge and Build Wide Table Tutorial](https://lakesoul-io.github.io/docs/Tutorials/mutil-stream-merge) on how to merge multiple stream with same primary key (and different other columns) concurrently without join.
Expand All @@ -53,6 +59,9 @@ Please find usage documentations in doc site:
[使用文档](https://lakesoul-io.github.io/zh-Hans/docs/Usage%20Docs/setup-meta-env)

# Feature Roadmap
* Data Science and AI
- [x] Native Python Reader (without PySpark)
- [x] PyTorch Dataset and distributed training
* Meta Management ([#23](https://github.com/lakesoul-io/LakeSoul/issues/23))
- [x] Multiple Level Partitioning: Multiple range partition and at most one hash partition
- [x] Concurrent write with auto conflict resolution
Expand All @@ -74,8 +83,6 @@ Please find usage documentations in doc site:
- [ ] Materialized View
- [ ] Incremental MV Build
- [ ] Auto query rewrite
* Data Science
- [ ] Native Python Reader (without PySpark)
* Spark Integration
- [x] Table/Dataframe API
- [x] SQL support with catalog except upsert
Expand Down
2 changes: 1 addition & 1 deletion website/docs/01-Getting Started/01-setup-local-env.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ After unpacking spark package, you could find LakeSoul distribution jar from htt
wget https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/spark/spark-3.3.2-bin-hadoop-3.tgz
tar xf spark-3.3.2-bin-hadoop-3.tgz
export SPARK_HOME=${PWD}/spark-3.3.2-bin-hadoop3
wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.3.0/lakesoul-spark-2.3.0-spark-3.3.jar -P $SPARK_HOME/jars
wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-spark-2.4.0-spark-3.3.jar -P $SPARK_HOME/jars
```

:::tip
Expand Down
2 changes: 1 addition & 1 deletion website/docs/01-Getting Started/02-docker-compose.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ docker run --net lakesoul-docker-compose-env_default --rm -ti \
-v $(pwd)/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \
--env lakesoul_home=/opt/spark/work-dir/lakesoul.properties bitnami/spark:3.3.1 \
spark-shell \
--packages com.dmetasoul:lakesoul-spark:2.3.0-spark-3.3 \
--packages com.dmetasoul:lakesoul-spark:2.4.0-spark-3.3 \
--conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \
--conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \
--conf spark.sql.defaultCatalog=lakesoul \
Expand Down
8 changes: 4 additions & 4 deletions website/docs/02-Tutorials/02-flink-cdc-sink/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,12 @@ You can see that there is currently only one `default` database in LakeSoul, and
## 2. Start the sync job
### 2.1 Start a local Flink Cluster
You can download [Flink 1.14.5](https://archive.apache.org/dist/flink/flink-1.14.5/flink-1.14.5-bin-scala_2.12.tgz) from the Flink download page.
You can download from the Flink download page: [Flink 1.17](https://www.apache.org/dyn/closer.lua/flink/flink-1.17.1/flink-1.17.1-bin-scala_2.12.tgz)
Unzip the downloaded Flink installation package:
```bash
tar xf flink-1.14.5-bin-scala_2.12.tgz
export FLINK_HOME=${PWD}/flink-1.14.5
tar xf flink-1.17.1-bin-scala_2.12.tgz
export FLINK_HOME=${PWD}/flink-1.17.1
````

Then start a local Flink Cluster:
Expand All @@ -90,7 +90,7 @@ Submit a LakeSoul Flink CDC Sink job to the Flink cluster started above:
```bash
./bin/flink run -ys 1 -yjm 1G -ytm 2G \
-c org.apache.flink.lakesoul.entry.MysqlCdc\
lakesoul-flink-2.3.0-flink-1.14.jar \
lakesoul-flink-2.3.0-flink-1.17.jar \
--source_db.host localhost \
--source_db.port 3306 \
--source_db.db_name test_cdc \
Expand Down
14 changes: 7 additions & 7 deletions website/docs/03-Usage Docs/02-setup-spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ To use `spark-shell`, `pyspark` or `spark-sql` shells, you should include LakeSo

#### Use Maven Coordinates via --packages
```bash
spark-shell --packages com.dmetasoul:lakesoul-spark:2.3.0-spark-3.3
spark-shell --packages com.dmetasoul:lakesoul-spark:2.4.0-spark-3.3
```

#### Use Local Packages
You can find the LakeSoul packages from our release page: [Releases](https://github.com/lakesoul-io/LakeSoul/releases).
Download the jar file and pass it to `spark-submit`.
```bash
spark-submit --jars "lakesoul-spark-2.3.0-spark-3.3.jar"
spark-submit --jars "lakesoul-spark-2.4.0-spark-3.3.jar"
```

Or you could directly put the jar into `$SPARK_HOME/jars`
Expand All @@ -34,7 +34,7 @@ Include maven dependencies in your project:
<dependency>
<groupId>com.dmetasoul</groupId>
<artifactId>lakesoul</artifactId>
<version>2.3.0-spark-3.3</version>
<version>2.4.0-spark-3.3</version>
</dependency>
```

Expand Down Expand Up @@ -93,7 +93,7 @@ spark.sql.sources.default lakesoul
## Setup Flink Project or Job

### Required Flink Version
Currently Flink 1.14 is supported.
Since 2.4.0, Flink version 1.17 is supported.

### Setup Metadata Database Connection for Flink

Expand Down Expand Up @@ -133,7 +133,7 @@ If access to the Hadoop environment is required, the Hadoop Classpath environmen
```bash
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
```
For details, please refer to: [Flink on Hadoop](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/yarn/)
For details, please refer to: [Flink on Hadoop](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/resource-providers/yarn/)
:::

:::tip
Expand All @@ -144,7 +144,7 @@ taskmanager.memory.task.off-heap.size: 3000m
:::
### Add LakeSoul Jar to Flink's directory
Download LakeSoul Flink Jar from: https://github.com/lakesoul-io/LakeSoul/releases/download/v2.3.0/lakesoul-flink-2.3.0-flink-1.14.jar
Download LakeSoul Flink Jar from: https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-flink-2.4.0-flink-1.17.jar
And put the jar file under `$FLINK_HOME/lib`. After this, you could start flink session cluster or application as usual.

Expand All @@ -155,6 +155,6 @@ Add the following to your project's pom.xml
<dependency>
<groupId>com.dmetasoul</groupId>
<artifactId>lakesoul</artifactId>
<version>2.3.0-flink-1.14</version>
<version>2.4.0-flink-1.17</version>
</dependency>
```
8 changes: 4 additions & 4 deletions website/docs/03-Usage Docs/05-flink-cdc-sync.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ In the Stream API, the main functions of LakeSoul Sink are:

## How to use the command line
### 1. Download LakeSoul Flink Jar
It can be downloaded from the LakeSoul Release page: https://github.com/lakesoul-io/LakeSoul/releases/download/v2.3.0/lakesoul-flink-2.3.0-flink-1.14.jar.
It can be downloaded from the LakeSoul Release page: https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-flink-2.4.0-flink-1.17.jar.

The currently supported Flink version is 1.14.
The currently supported Flink version is 1.17.

### 2. Start the Flink job

Expand Down Expand Up @@ -60,7 +60,7 @@ export LAKESOUL_PG_PASSWORD=root
#### 2.2 Start sync job
```bash
bin/flink run -c org.apache.flink.lakesoul.entry.MysqlCdc \
lakesoul-flink-2.3.0-flink-1.14.jar \
lakesoul-flink-2.4.0-flink-1.17.jar \
--source_db.host localhost \
--source_db.port 3306 \
--source_db.db_name default \
Expand All @@ -79,7 +79,7 @@ Description of required parameters:
| Parameter | Meaning | Value Description |
|----------------|------------------------------------|-------------------------------------------- |
| -c | The task runs the main function entry class | org.apache.flink.lakesoul.entry.MysqlCdc |
| Main package | Task running jar | lakesoul-flink-2.3.0-flink-1.14.jar |
| Main package | Task running jar | lakesoul-flink-2.4.0-flink-1.17.jar |
| --source_db.host | The address of the MySQL database | |
| --source_db.port | MySQL database port | |
| --source_db.user | MySQL database username | |
Expand Down
4 changes: 2 additions & 2 deletions website/docs/03-Usage Docs/06-flink-lakesoul-connector.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ LakeSoul provides Flink Connector which implements the Dynamic Table interface,

To setup Flink environment, please refer to [Setup Spark/Flink Job/Project](../03-Usage%20Docs/02-setup-spark.md)

Introduce LakeSoul dependency: package and compile the lakesoul-flink folder to get lakesoul-flink-2.3.0-flink-1.14.jar.
Introduce LakeSoul dependency: package and compile the lakesoul-flink folder to get lakesoul-flink-2.4.0-flink-1.17.jar.

In order to use Flink to create LakeSoul tables, it is recommended to use Flink SQL Client, which supports direct use of Flink SQL commands to operate LakeSoul tables. In this document, the Flink SQL is to directly enter statements on the Flink SQL Client cli interface; whereas the Table API needs to be used in a Java projects.

Switch to the flink folder and execute the command to start the SQLclient client.
```bash
# Start Flink SQL Client
bin/sql-client.sh embedded -j lakesoul-flink-2.3.0-flink-1.14.jar
bin/sql-client.sh embedded -j lakesoul-flink-2.4.0-flink-1.14.jar
```

## 2. DDL
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-without-hadoop.tgz

LakeSoul 发布 jar 包可以从 GitHub Releases 页面下载:https://github.com/lakesoul-io/LakeSoul/releases 。下载后请将 Jar 包放到 Spark 安装目录下的 jars 目录中:
```bash
wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.3.0/lakesoul-spark-2.3.0-spark-3.3.jar -P $SPARK_HOME/jars
wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-spark-2.4.0-spark-3.3.jar -P $SPARK_HOME/jars
```

如果访问 Github 有问题,也可以从如下链接下载:https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-spark-2.3.0-spark-3.3.jar
如果访问 Github 有问题,也可以从如下链接下载:https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-spark-2.4.0-spark-3.3.jar

:::tip
从 2.1.0 版本起,LakeSoul 自身的依赖已经通过 shade 方式打包到一个 jar 包中。之前的版本是多个 jar 包以 tar.gz 压缩包的形式发布。
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ docker run --net lakesoul-docker-compose-env_default --rm -ti \
-v $(pwd)/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \
--env lakesoul_home=/opt/spark/work-dir/lakesoul.properties bitnami/spark:3.3.1 \
spark-shell \
--packages com.dmetasoul:lakesoul-spark:2.3.0-spark-3.3 \
--packages com.dmetasoul:lakesoul-spark:2.4.0-spark-3.3 \
--conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \
--conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \
--conf spark.sql.defaultCatalog=lakesoul \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,12 @@ SHOW TABLES IN default;
## 2. 启动同步作业

### 2.1 启动一个本地的 Flink Cluster
可以从 Flink 下载页面下载 [Flink 1.14.5](https://archive.apache.org/dist/flink/flink-1.14.5/flink-1.14.5-bin-scala_2.12.tgz),也可以从我们的[国内镜像地址下载](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/flink-1.14.5-bin-scala_2.12.tgz)(与Apache官网完全相同)
可以从 Flink 下载页面下载 [Flink 1.17](https://www.apache.org/dyn/closer.lua/flink/flink-1.17.1/flink-1.17.1-bin-scala_2.12.tgz)。

解压下载的 Flink 安装包:
```bash
tar xf flink-1.14.5-bin-scala_2.12.tgz
export FLINK_HOME=${PWD}/flink-1.14.5
tar xf flink-1.17.1-bin-scala_2.12.tgz
export FLINK_HOME=${PWD}/flink-1.17.1
```

然后启动一个本地的 Flink Cluster:
Expand All @@ -90,7 +90,7 @@ $FLINK_HOME/bin/start-cluster.sh
```bash
./bin/flink run -ys 1 -yjm 1G -ytm 2G \
-c org.apache.flink.lakesoul.entry.MysqlCdc \
lakesoul-flink-2.3.0-flink-1.14.jar \
lakesoul-flink-2.4.0-flink-1.17.jar \
--source_db.host localhost \
--source_db.port 3306 \
--source_db.db_name test_cdc \
Expand All @@ -105,7 +105,7 @@ $FLINK_HOME/bin/start-cluster.sh
--server_time_zone UTC
```

其中 lakesoul-flink 的 jar 包可以从 [Github Release](https://github.com/lakesoul-io/LakeSoul/releases/) 页面下载。如果访问 Github 有问题,也可以通过这个链接下载:https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-flink-2.3.0-flink-1.14.jar
其中 lakesoul-flink 的 jar 包可以从 [Github Release](https://github.com/lakesoul-io/LakeSoul/releases/) 页面下载。如果访问 Github 有问题,也可以通过这个链接下载:https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-flink-2.4.0-flink-1.17.jar

在 http://localhost:8081 Flink 作业页面中,点击 Running Job,进入查看 LakeSoul 作业是否已经处于 `Running` 状态。

Expand Down
Loading

0 comments on commit 582508b

Please sign in to comment.