Releases: lakesoul-io/LakeSoul
LakeSoul Release v2.3.0
v2.3.0 Release Notes
This is the first release after LakeSoul donated to Linux Foundation AI & Data. This release contains the following major new features:
- Flink Connector for Flink SQL/Table API to read or write LakeSoul in both batch and streaming mode, with the supports of Flink Changelog Stream semantics and row-level upsert and delete. See docs Flink Connector.
- Flink CDC Ingestion refactored to infer new tables and schema changes automatically from messages. This enables simpler CDC stream ingestion job development for any kinds of database or message queues.
- Global automatic compaction service. See docs Auto Compaction Service.
更新日志
这是 LakeSoul 捐赠给 Linux Foundation AI & Data 后的第一个发布版本。该版本包含以下重要更新:
- 全面支持 Flink SQL/Table API. LakeSoul 支持 Flink 流、批读写。流式读写完整支持 Flink Changelog 语义,支持行级别流式增删改。参考文档
- Flink CDC 整库同步重构,支持从消息中自动推断新表和 schema 变更。能够更简单的开发 CDC 入湖作业并支持消费任意数据库 CDC 流或消息队列流。
- 全局自动 Compaction 服务。参考文档:LakeSoul 全局自动压缩服务使用方法
What's Changed
- [NativeIO] Native io misc improvements by @dmetasoul01 in #190
- optimize filesForScan by @F-PHantam in #192
- Add Definition Comments for com.dmetasoul.lakesoul.meta.entity by @YuChangHui in #193
- Implement Delta Join Interfaces for LakeSoulTable by @YuChangHui in #184
- [Flink] pack paranamer to flink release jar by @dmetasoul01 in #196
- [NativeIO] use tcmalloc as global allocator by @xuchen-plus in #204
- [NativeIO] fix memory leak in native reader by @xuchen-plus in #209
- [Flink] avoid cast global parameter to ParameterTool by @xuchen-plus in #207
- migrate arrow-rs and datafusion deps to new org by @xuchen-plus in #211
- Implement Global Automatic Disaggregated Compaction Service by @F-PHantam in #212
- Implement Flink ScanTableSource and LookupTableSource by @YuChangHui in #213
- fix data type timestamp with zone by @lypnaruto in #215
- [NativeIO]throw execption when LakeSoulArrowReader.hasNext by @Ceng23333 in #217
- [NativeIO]add rust clippy workflow && fix clippy error/warn by @Ceng23333 in #219
- add flink sql submitter(#199) by @Hades-888 in #221
- Update readme by @xuchen-plus in #222
- bump version to 2.3.0 by @xuchen-plus in #223
- update github links by @xuchen-plus in #224
- fix bug: requested file schema no change in stream task by @F-PHantam in #226
- [Flink]LakeSoulCatalog::listTables: list tableName instead of tablePath by @Ceng23333 in #227
- [Flink]fix parse error of LogicalTypeRoot::Date by @Ceng23333 in #228
- [NativeIO]panic when target datatype and source datatype mismatch by @Ceng23333 in #214
- [Flink]support flink decimal by @Ceng23333 in #232
- update LakeSoulTableSource.getChangelogMode by @Ceng23333 in #231
- [NativeIO]fix clippy warning by @Ceng23333 in #230
- Fix hash bucket num by @xuchen-plus in #233
- [Flink]add batch in flink sql submitter by @Hades-888 in #234
- disable tcmalloc by @xuchen-plus in #235
- [Project] add lakesoul project website code by @xuchen-plus in #237
- update load flink sql from hdfs in yarn application by @Hades-888 in #238
- [Flink]add Maven-test CI for lakesoul-flink by @lypnaruto in #239
- Add cross build for native io by @xuchen-plus in #241
- [Project] disable git lfs by @xuchen-plus in #243
- fix bugs for same bucket readed by differnet stream tasks by @moresun in #245
- [Project] Add pr checks and deployment actions by @xuchen-plus in #244
- [Flink]fix FlinkDatatype::timestamp_ltz zone conversion && support FlinkDatatype::timestamp by @Ceng23333 in #246
- Prepare meta in maven test by @xuchen-plus in #247
- [Flink]Fix LookupSource FS configuration setting by @Ceng23333 in #248
- LakeSoul mysql cdc convert Datatype::datetime to timestamp with timezone by @F-PHantam in #249
- [Spark] Fix compatibility with spark 3.3.2 by @xuchen-plus in #251
- add flink source and sink ci test by @F-PHantam in #252
- [Flink] fix wrong logging config file in flink test by @xuchen-plus in #253
- [Flink] Move partition column fill to native io by @xuchen-plus in #254
- Fix datatype conversion from flink to spark by @Ceng23333 in #255
- [Flink] Add source failover test cases by @xuchen-plus in #256
- [Flink] LakeSoulSinkGlobalCommitter by @Ceng23333 in #257
- add LAKESOUL_PARTITION_SPLITTER as constant by @Ceng23333 in #260
- remove guava and commons-lang in common module by @xuchen-plus in #261
- Modify mysqlcdc sort key generation way by @F-PHantam in #263
- [Flink] Add sink failover test cases by @Ceng23333 in #259
- [Flink] Fix flink reader npe by @xuchen-plus in #265
- [Flink]complete test options of sink fail tests by @Ceng23333 in #266
- Refine meta partition values by @xuchen-plus in #267
- [Flink]Check schema migration at GlobalCommitter by @Ceng23333 in #269
- Fix meta exception handling by @xuchen-plus in #270
- Update website and readme for 2.3.0 release by @xuchen-plus in #271
v2.2.0
LakeSoul Release v2.2.0
v2.2.0 Release Notes
- Native IO is by default enabled for Flink CDC Sink and Spark SQL. Native IO uses arrow-rs and Datafusion with special IO optimizations based on arrow-rs' object store. Benchmarks show 3x IO throughput improvement over parquet-mr and Hadoop filesystem. Native IO supports both HDFS and S3 object storage (including S3 protocol compatible storages). Native IO supports all data types in Spark and Flink and has passed both TPC-H and CHBenchmark correctness tests.
- Snapshot read and incremental read support on Spark. LakeSoul's incremental read on spark supports both batch mode and microbatch streaming mode.
- Default supported Spark's version has been upgraded to Spark 3.3.
v2.2.0 发布日志
- Native IO 在 Flink 和 Spark 上默认启用。Native IO 使用 arrow-rs 和 [Datafusion] (https://github.com/apache/arrow-datafusion) 实现,并在 arrow-rs object store 上做了专门的性能优化。在实际测试中比 parquet-mr+hadoop filesystem 快 3 倍以上。Native IO 可以支持 HDFS 和 S3 存储,以及与 S3 兼容的存储系统。Native IO 经过了详细的测试,能够支持 Flink、Spark 所有数据类型,并通过了 TPC-H 和 CHBenchmark 的正确性校验。
- 在 Spark 上支持了快照读和增量读功能。增量读功能可以支持 batch 模式和 micro batch streaming 模式。
- 默认的 Spark 版本更新到 3.3.
What's Changed
- [Feature] Timestamp based snapshot read, rollback and cleanup by @dmetasoul01 in #104
- [Flink] write timestamp to int64 instead of int96 in flink sink by @dmetasoul01 in #106
- Only one partition and compaction to parquet scan by @F-PHantam in #109
- Bump postgresql from 42.5.0 to 42.5.1 in /lakesoul-common by @dependabot in #111
- Incremental query by @lypnaruto in #110
- Add Benchmarks by @dmetasoul01 in #115
- Flink serde optimization by @dmetasoul01 in #117
- Develop/native io spark by @Ceng23333 in #118
- Fix CI with Maven Test by @Ceng23333 in #121
- Support Kafka multiple topics sync to LakeSoul by @F-PHantam in #122
- solve dependency problem of confluent jar by @F-PHantam in #124
- fix maven-test with native-io by @Ceng23333 in #125
- [NativeIO] Native io parquet writer implementation by @dmetasoul01 in #128
- [Spark] Streaming Read by @lypnaruto in #129
- [Spark] Upgrade Spark version to 3.3 for main branch by @dmetasoul01 in #132
- use Arrow Schema instead of HashMap for lakesoul_reader filter by @YuChangHui in #136
- [NativeIO] Native writer c and jnr-ffi interface by @dmetasoul01 in #137
- [NativeIO] fix native reader memory leak and double free by @dmetasoul01 in #138
- [NativeIO] Native writer with primary keys sort support by @dmetasoul01 in #141
- [NativeIO] Use ffi to pass arrow schema by @dmetasoul01 in #142
- [NativeIO][Flink] Implement Flink native writer by @dmetasoul01 in #143
- [NativeIO] fix callback object reference by @dmetasoul01 in #145
- [NativeIO] upgrade arrow-rs to 31 and datafusion to 17 by @dmetasoul01 in #148
- [NativeIO][Spark] Package native lib in lakesoul-spark jar by @dmetasoul01 in #149
- [NativeIO] use maven profile for native packaging. default to local native build by @dmetasoul01 in #150
- [NativeIO][Spark] Integrate nativeIO writer in lakesoul-spark by @F-PHantam in #151
- [NativeIO] Implement Sorted Stream Merger by @Ceng23333 in #147
- fix ParquetNativeFilterSuite by @Ceng23333 in #152
- [NativeIO][Bug] Fix flink writer panic by @dmetasoul01 in #154
- [NativeIO] optimize with smallvec for native merge by @dmetasoul01 in #155
- [NativeIO][Flink] fix flink writer batch reset in java before write by @dmetasoul01 in #157
- [NativeIO][Spark]Implement Interfaces for LakeSoulScanBuilder with Native-IO by @Ceng23333 in #156
- [NativeIO] Native hdfs object store by @dmetasoul01 in #159
- Add python api for snapshot and incremental query by @lypnaruto in #160
- fix memory leak. add columnar supports by @dmetasoul01 in #164
- [NativeIO] upgrade arrow version to 11 by @dmetasoul01 in #173
- support date type for parmary key in flink cdc by @moresun in #174
- [NativeIO][Flink] Fix Flink CDC Data Sort Bug and Handle DataType Change Issues from Mysql to LakeSoul by @F-PHantam in #175
- fix native_io_timestamp_conversion for default case by @Ceng23333 in #176
- Fix flink ci by @dmetasoul01 in #177
- fix invalid LakeSoulSQLConf max_row_group_size in native_io_writer by @YuChangHui in #179
- fix snapshot query default start time by @YuChangHui in #182
- fix unexpectively close partitionColumnVectors on closeCurrentBatch by @Ceng23333 in #185
- add support for non-pk steaming read in spark by @moresun in #188
- upgrade jffi to 1.3.11 for centos 7 by @dmetasoul01 in #189
- [Native-IO]add native_io support for empty schema and struct type by @Ceng23333 in #180
Full Changelog: https://github.com/meta-soul/LakeSoul/commits/v2.2.0
v2.1.1
What's Changed
This is a bug fix release for v2.1.0.
Fixed bugs:
- Support geometry/point type in flink cdc by @Ceng23333 in #93
- [BUG] fix pg password auth failed exception by @dmetasoul01 in #95
- Add checkpoint_mode to flink job entry by @Ceng23333 in #96
Full Changelog: 2.1.0...v2.1.1
v2.1.0
v2.1.0 Release Notes
LakeSoul 2.1.0 brings new Flink CDC sink implementation which supports all tables (with different schemas) in one entire MySQL database sync in one Flink job, automatic schema sync and evolution, automatic new table creation and exactly once guarantee. The currently supported flink version is 1.14.
In 2.1.0 we also reimplement Spark catalog so that it could be used as a standalone catalog rather than a session catalog extension. This change is to avoid some inconsistencies in Spark's v2 table commands, e.g. show tables
cannot support v2 tables until 3.3.
Packages for Spark and Flink are separated into two maven submodules. The maven coordinates are com.dmetasoul:lakesoul-spark:2.1.0-spark-3.1.2
and com.dmeatsoul:lakesoul-flink:2.1.0-flink-1.14
. All the required transitive dependencies have already been shaded into the released jars.
v2.1.0 发布日志
LakeSoul 2.1.0 增加了全新的 Flink CDC Sink 功能,支持 MySQL 数据库整库千表(支持不同 schema)同步,自动 Schema 变更同步,自动新表感知和严格一次(Exactly Once)语义保证。
Spark 支持部分重写了 Catalog 的实现,使得 Catalog 可以作为非 Session Catalog 扩展使用,主要目的是规避 Spark 在 3.3 版本之前,一些 DDL Command 不支持 V2 表的问题。
Spark 和 Flink 分别拆分成了两个 Maven 子模块。在工程中引用的 Maven 坐标分别是 com.dmetasoul:lakesoul-spark:2.1.0-spark-3.1.2
and com.dmeatsoul:lakesoul-flink:2.1.0-flink-1.14
。他们各自的依赖库已经通过 shade 的方式打包到了发布的 jar 包中。
Merged Pull Requests
- CDC support v1: add table property to identify change kind column by @dmetasoul01 in #1
- Cdc support v2 by @moresun in #3
- support merge into sql when can be converted to upsert by @dmetasoul01 in #4
- Optimize duplicate tests and code by @dmetasoul01 in #6
- support create hash partitioned table by sql by @dmetasoul01 in #7
- remove cdc filter from mergescan by @moresun in #9
- fix build error and some coding styles by @bakey in #10
- Update README.md by @moresun in #13
- add a cdc sink example by @dmetasoul01 in #17
- update all links in readme to relative by @dmetasoul01 in #18
- [Doc] add cdc cn doc by @dmetasoul01 in #19
- Bump fastjson from 1.2.75 to 1.2.83 by @dependabot in #38
- Catalog refactor by @dmetasoul01 in #45
- Bump mysql-connector-java from 8.0.19 to 8.0.28 by @dependabot in #46
- Bump postgresql from 42.2.14 to 42.3.3 by @dependabot in #47
- bump version to 2.0.0 by @dmetasoul01 in #48
- fix maven packaging by @dmetasoul01 in #55
- Feature flink order sink by @YangZH-v2 in #56
- add parquet-column dependency fix localEnv unable run bug by @YangZH-v2 in #64
- support exactly once semantics for flink write by @YangZH-v2 in #65
- fix filter bug when cdc column is not used by @F-PHantam in #68
- Align hash bucket and sort logic in flink with spark #60 by @YangZH-v2 in #69
- Split submodules for maven project by @F-PHantam in #70
- Bump postgresql from 42.3.3 to 42.4.1 in /lakesoul-common by @dependabot in #71
- add MergeNonNullOp for merge operator by @moresun in #73
- add docker compose for local test. fix maven install gpg signing by @dmetasoul01 in #76
- clean up unused code by @dmetasoul01 in #77
- fix MultiPartitionMergeBucketScan bug by @F-PHantam in #81
- Fix flink cdc write event order by @dmetasoul01 in #82
- supports database(namespace) & support mysql cdc using flink by @Ceng23333 in #85
- Bump snakeyaml from 1.30 to 1.31 in /lakesoul-spark by @dependabot in #88
- Support multiple tables sink for Flink CDC by @dmetasoul01 in #86
- flink cdc task add argument serverTimeZone by @F-PHantam in #90
- Fix maven dependency by @dmetasoul01 in #91
New Contributors
- @dmetasoul01 made their first contribution in #1
- @moresun made their first contribution in #3
- @bakey made their first contribution in #10
- @dependabot made their first contribution in #38
- @YangZH-v2 made their first contribution in #56
- @F-PHantam made their first contribution in #68
- @Ceng23333 made their first contribution in #85
Full Changelog: https://github.com/meta-soul/LakeSoul/commits/2.1.0
v2.0.1-spark-3.1.2
What's Changed
- fix maven packaging by @dmetasoul01 in #55
v2.0.0-spark-3.1.2
1. Catalog refactoring
- Replacing the Cassandra protocol with the Postgres protocol
- metadata Use PG protocol to rewrite table operations, partition operations, and data operation related functions, and use transaction mechanism to achieve data submission collision detection to ensure ACID attributes
- Interface with Spark and metadata, translate Spark-related metadata operations into the underlying interface, and realize the cross-border distribution between the upper computing platform and the underlying development storage layer
2. DDL
- Spark SQL related DDL statements (create alter, etc.) transformation
- Spark DataFrame | DataSet related DDL statement (save, etc.) transformation
3. Data Writing
- Transformation of SparkSQL-related DML statements (insert into, update, etc.)
- Spark DataFrame | DataSet related DML statements (write function, etc.)
- LakeSoulTable upsert function transformation
- LakeSoulTable compaction function transformation, and support to mount to hive
4. Data Reading
- A variety of ParquetScan transformation, remove the write version sorting mechanism, adapt to the new metadata UUID file list format
- LakeSoulTable adds a snapshot reading function to read the historical content according to the specified partition version
- LakeSoulTable adds a history rollback function to roll back to a certain historical version of the specified partition
- Added and modified the default MergeOprator function to make it easier for users to operate Merge results