Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

测试一下中文翻译成英语是否好用 #4

Open
hnwyllmm opened this issue May 13, 2024 · 4 comments
Open

测试一下中文翻译成英语是否好用 #4

hnwyllmm opened this issue May 13, 2024 · 4 comments

Comments

@hnwyllmm
Copy link
Owner

Question

我看官方文档中存在如下描述:

ob-spec

如果有一个电商 SaaS,【订单表】有 商户ID( int ) 和 订单号( varchar(32) ) 存在唯一约束,这个时候我们也应该用 这两个字段 做联合主键吗?

众所周知,MySQL 采用的是 B+Tree 的结构来存储数据,一般建议 ID是自增的,这样可以避免随机插入,导致 MySQL 数据页频繁分裂,进而导致检索数据时 IO 开销增加。

然而,商户ID + 订单号 也很明显不是自增的,所以我才有此疑问。
对于这种场景,如果 OceanBase 仍然推荐使用它们联合作为主键,是因为 LSM-Tree 能够很好地处理 裂页 问题么 ?

此外,除了主键索引,我们一般还会额外在关键业务表上创建几个二级索引。
在 MySQL 中,因为每个二级索引的叶子节点都会存储相应行的主键值。所以,如果主键较大,那么每个二级索引的叶子节点所需的空间也会更大,因为它们需要存储更大的主键值。

如果主键是 bigint 的自增ID,在每个二级索引上存储,关联主键也就占用 8个字节 的空间。
如果主键是 int + varchar(32) 这种联合索引,是不是就意味着,关联主键也要占用至少 4+32 个字节的空间呢?
如果有多个二级索引(一般2~4个),是不是也要重复多占用不少空间 ?这样的话,存储空间 以及 维护这些二级索引的开销不是也增大了许多么 ?

@hnwyllmm
Copy link
Owner Author

不按顺序回答你的问题。
先说“如果主键是 int + varchar(32) 这种联合索引,是不是就意味着,关联主键也要占用至少 4+32 个字节的空间呢?”
--> 在OceanBase 中 int也是按照8位来存储的,所以这里应该就是 8+32=40字节,一条数据比bigint自增ID多了32个字节。按照1亿个数据量来算,就多了大概3G的存储。如果有3个索引,也不过10G的存储。先不说你肯定不在乎这10G的存储空间(毕竟业务量都上亿了),OceanBase在存储时还会压缩的,最终的数据,即使是多副本,也通常比MySQL占用的存储空间少了50%,具体可以看看官网上的一些案例描述,或许会有案例可以帮到你们。


再看“自增ID作为主键”的问题,在分布式数据库上这是一个老生常谈的话题,看看gpt的回答:

写入热点:自增主键意味着新插入的行总是拥有比之前高的主键值。这在分布式数据库中可能会造成主键值集中在某个范围内,从而导致所有新的插入操作都集中在特定的分区或节点上。这种情况创建了所谓的“写入热点”,可能会降低系统的性能和扩展性。

分布式事务冲突:在分布式系统中保持主键的唯一自增值需要跨多个节点协调,这可能会导致无法避免的分布式事务冲突和协调开销。同步自增值可能会导致延迟和额外的性能损耗。

扩展性限制:自增主键强制了数据的插入顺序,这可能导致数据在物理存储上的不均匀分布,对于需要水平扩展的分布式数据库来说,这是一个不利因素。扩展操作需要平衡和迁移数据,而自增主键可能会使这一过程更加复杂和低效。


网上也有很多自增主键的博客。大概意思都是自增主键在非分区数据库上都挺和谐,但是分区数据库上很困难。
对此,OceanBase还对自增列做了优化,参考自增 noorder 模式:https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000751258


在来看“对于这种场景,如果 OceanBase 仍然推荐使用它们联合作为主键,是因为 LSM-Tree 能够很好地处理 裂页 问题么 ?”
LSM-Tree不像B+树,没有列页的问题。新来的数据更新,不管是新增、更新还是删除,都是先在内存上操作的。在一定的平衡场景下,性能不会受到“数据太分散”相关的影响。而LSM的合并操作,都是在后台进行的。

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Answering your questions out of order.
Let me first say, "If the primary key is a joint index such as int + varchar(32), does it mean that the associated primary key will also occupy at least 4+32 bytes of space?"
--> In OceanBase, int is also stored in 8 bits, so it should be 8+32=40 bytes. A piece of data is 32 bytes more than the bigint auto-increment ID. Calculated based on the amount of data of 100 million, there is approximately 3G of additional storage. If there are 3 indexes, it will only take 10G of storage. Putting aside the fact that you definitely don’t care about the 10G storage space (after all, the business volume is hundreds of millions), OceanBase will also compress it during storage. The final data, even with multiple copies, usually takes up 50% less storage space than MySQL. %, for details, you can check out some case descriptions on the official website. There may be cases that can help you.


Let’s look at the issue of “self-increasing ID as primary key”. This is a common topic in distributed databases. Take a look at gpt’s answer:

Write hotspots: Auto-incrementing primary keys means that newly inserted rows always have a higher primary key value than the previous one. In a distributed database, this may cause primary key values ​​to be concentrated in a certain range, causing all new insert operations to be concentrated on a specific partition or node. This situation creates so-called "write hotspots" that can reduce system performance and scalability.

Distributed transaction conflicts: Maintaining a unique auto-increment value for a primary key in a distributed system requires coordination across multiple nodes, which may lead to unavoidable distributed transaction conflicts and coordination overhead. Synchronous auto-increment may cause latency and additional performance loss.

Scalability limitations: Auto-incrementing primary keys enforce the insertion order of data, which may lead to uneven distribution of data on physical storage. This is a disadvantage for distributed databases that require horizontal expansion. Scaling operations require balancing and migrating data, and auto-incrementing primary keys can make this process more complex and inefficient.


There are also many blogs on the Internet that add primary keys. The general meaning is that auto-incrementing primary keys are quite harmonious on non-partitioned databases, but are very difficult on partitioned databases.
In this regard, OceanBase has also optimized the auto-increment column. Please refer to the auto-increment noorder mode: https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000751258


Let's look at "For this scenario, if OceanBase still recommends using their union as the primary key, is it because LSM-Tree can handle the split page problem well?"
Unlike B+ tree, LSM-Tree does not have the problem of column pages. New data updates, whether they are added, updated or deleted, are first operated on the memory. In a certain balanced scenario, performance will not be affected by "data is too scattered". The merge operations of LSM are all performed in the background.

@hnwyllmm
Copy link
Owner Author

测试短一点的中文

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Test shorter Chinese

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants