feat: introduce background gc into SwonDisk #41

Fischer0522 · 2024-09-18T07:56:33Z

GC Draft

在目前的 GC 设计方案中，核心是 GCWorker，完成 Victim 的选择，数据的迁移，索引的更新。

数据布局

为了统计空间利用率和完成数据的迁移，引入了 Chunk 的概念，定义的大小为 1024 Block，与写入 Log 的大小一致. ChunkInfo 定义结构如下：

pub struct ChunkInfo {
    chunk_id: ChunkId,
    // valid_block statistic all empty blocks and blocks that have been marked as allocated,
    // it's used for GC to choose victim chunk, and is initialized with nblocks when chunk is created
    // when a block is deallocated, valid_block is decremented
    // TODO: Currently, valid_block is only associated with block deallocation, we need to consider block reallocation
    valid_block: AtomicUsize,
    // bitmap of blocks in the chunk
    bitmap: Arc<Mutex<BitMap>>,
    nblocks: usize,
    free_space: AtomicUsize,
}

目前设计有点冗余，对应结构如下：

valid_block 用于计算 Victim 的阈值，只有 invalid block 达到了一定程度才会选取为 Victim，但目前并没有考虑重新分配的情况，即被 Dealloc 的 Block 再次重新 Alloc 时，该数据不会++，但在目前的设计中，这种情况只会出现于 AllocTable 的 next_avail 走完一周后从头开始重新分配的情况。
free_space，Alloc++, DeAlloc--，用于选择目标块时跳过满的块
nblocks，代表 ChunkSize，主要是用于针对最后一个 Chunk 不能被整除的问题

valid_block, free_space 需要持久化，通过 TxLog 来实现，其他的内容可以在重启时自动计算恢复，bitmap 为一个引用，无需持久化

Victim

目前定义为一个 trait，支持了贪心和循环扫描两种策略，后续可进行扩展

pub trait VictimPolicy: Send + Sync {
    fn pick_victim(&self, chunk_alloc_tables: &[ChunkInfo], threshold: f64) -> Option<Victim>;
}

并发控制

GcWorker 目前会 stop the world：

后台 GC 阻塞前台 GC
后台 GC 阻塞 Compaction
后台 GC 阻塞前台 IO（GC 不会访问 data_buf，因此对 data_buf IO 是安全的，不会阻塞）
Compaction 阻塞后台 GC

pub type SharedStateRef = Arc<SharedState>;
pub struct SharedState {
    gc_in_progress: CvarMutex<bool>,
    compaction_in_progress: CvarMutex<bool>,
    gc_condvar: Condvar,
    compaction_condvar: Condvar,
}

TODO：GC 并不会等待所有的 IO 完成后再开始，因此 GC 会和 GC 开始前还未完成的 IO 产生一定程度的并发，如何解决？

GcWorker

GcWorker 通过 DiskInner 创建，以独立后台线程的方式运行，目前以定时的方式周期性启动 Background GC，根据 VictimPolicy 选出对应的 Victim，对其中的数据完成迁移，将旧空间设置为空，最终更新 ReverseIndex 和 LogicalBlockTable。结构如下：

pub(super) struct GcWorker<D> {
    victim_policy: VictimPolicyRef,
    logical_block_table: TxLsmTree<RecordKey, RecordValue, D>,
    reverse_index_table: Arc<ReverseIndexTable>,
    block_validity_table: Arc<AllocTable>,
    _tx_log_store: Arc<TxLogStore<D>>, // unused
    user_data_disk: Arc<D>,
    shared_state: SharedStateRef,
}

目前仅支持周期性 Background GC，当前定义了前台 GC 的功能，但暂时未与前台读写集成（Condvar or Channel?）

TODO: 目前后台线程依赖于 std::thread:: sleep，后续需要支持跨平台

数据迁移

根据 VictimPolicy 选择一组 Chunk，这里参考了 JinDisk 版本，定义了 WATERMARK，默认值为 16，在循环中如果某一次没选出 Victim 则终止
根据 Victim 中的有效 Block，选择出足够的 target hba
遍历所有有效 Block，反查得到 LBA，再通过 LBA 查询 HBA，如果两次 HBA 不相等，证明当前的 Block 已经失效，但是还未被 Compaction 回收，直接将该 Block 丢弃，并且将 LBA->HBA 保存到 ReverseIndexTable 中
完成数据迁移，在目前的实现方案中：
1. 以 Chunk 为单位将数据从磁盘中读取出来，
2. 将 target hba 进行分组，将连续的 HBA 聚集为一个 batch
3. 迭代 Victim hba，从中选择出足够填充这个 batch 数量的 hba，发起一次 IO，将所有数据写入到其中
更新 Meta：
1. 更新 ChunkInfo，清空对应的 Bitmap 和 ChunkInfo
2. 更新 AllocTable；将 target hba 全部 alloc
3. 更新 ReverseIndex：清除掉其中 Victim hba，更新 dealloc table，保证 compaction 时不会 double free
4. 更新 LSM: 将新的映射关系写入到其中

…and I/O request

lucassong-mh · 2024-09-23T03:53:32Z

pub struct ChunkInfo {...}

Chunk的命名会和L3中的Chunk冲突，可以命名为Segment，或者既然和L3中的chunk都代表1024个block，可以用MetaChunk/DataChunk区分
当前ChunkInfo的记录的信息是否可以和BVT整合？通过扩充BVT的功能做到支持查找victim segment，扩充当前BVT的功能可以直接继承TxLog的特性，并且会带来其他好处：L5分配块时可以直接按Segment粒度查找、分配
“最后一个 Chunk 不能被整除的问题”，数据区域的大小可以先保证以Segment为计量单位，方便GC的分析和处理

TODO：GC 并不会等待所有的 IO 完成后再开始，因此 GC 会和 GC 开始前还未完成的 IO 产生一定程度的并发，如何解决？

并发控制应该没那么复杂，分层设计要求lower layer不会对upper layer有感知或依赖。因此只需要L4暴露一个“等待compaction完成”的API，在L5执行GC前调用一下即可，不需要在L4再去感知或等待跟GC有关的行为，和L5相关的都放在callback中完成

数据迁移

注意第4步和第5步需要考虑不会破坏当前系统本身提供的安全性，需要放在一个TX中执行

以 Chunk 为单位将数据从磁盘中读取出来

应该是将经由第3步后得到的有效块们统一读出来，有效块数量可能不足一个segment

当前定义了前台 GC 的功能，但暂时未与前台读写集成（Condvar or Channel?）

前台GC只需要在一次write完判断是否触发即可

pub(super) struct ReverseIndexTable {
index_table: Mutex<BTreeMap<Hba, Lba>>,
dealloc_table: Mutex<HashMap<Lba, Hba>>,
}

需要修正一下命名歧义，只有index_table是反查表，dealloc_table是用作避免double deallocation的

…actor

Fischer0522 added 12 commits September 6, 2024 22:57

feat: define ChunkAllocTable for GC

8c88e1e

feat: implement GcWorker draft

1d9076d

refactor: refactor victim policy

04ba0ed

refactor: AllocTable and ChunkInfo share the same BitMap

599d9fb

chore: add ut

e98794e

feat: implement index remapping

3f0c29d

feat: implemented stop-the-world GcWorker integrated with compaction …

554857e

…and I/O request

chore: remove logger

731836c

chore: add some comment

821e393

feat: implemented batch migration

9b246e1

chore: fix ut

824393d

feat: avoid double deallocation in compaction

549a257

Fischer0522 added 6 commits September 23, 2024 19:25

chore: rename chunk to segment

16bc7c0

feat: use tx to migrate data

a2e47a7

refactor: use tx to migrate data and update meta

f5252f0

fix: record lba in dealloc_table to avoid double free

6f8e129

feat: persist segment table in tx_log

be0ed86

fix: align_up seg_buf

d0c9319

Fischer0522 force-pushed the feat/gc_action branch from 5328e65 to d0c9319 Compare October 2, 2024 05:44

Fischer0522 added 7 commits October 2, 2024 17:04

feat: support column_family for TxLsmTree

21928c8

chore: add ut

93b3741

feat: recover column_family from wal

4b1b42f

refactor: defined ColumnFamilyManager to manage memtable,sst and comp…

38defea

…actor

feat: supported cross plantform sleep

4257503

fix: init log_id

ce14728

chore: add ReserseKey

bad7271

Fischer0522 closed this Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce background gc into SwonDisk #41

feat: introduce background gc into SwonDisk #41

Fischer0522 commented Sep 18, 2024 •

edited

Loading

lucassong-mh commented Sep 23, 2024 •

edited

Loading

feat: introduce background gc into SwonDisk #41

feat: introduce background gc into SwonDisk #41

Conversation

Fischer0522 commented Sep 18, 2024 • edited Loading

GC Draft

数据布局

Victim

并发控制

GcWorker

数据迁移

lucassong-mh commented Sep 23, 2024 • edited Loading

Fischer0522 commented Sep 18, 2024 •

edited

Loading

lucassong-mh commented Sep 23, 2024 •

edited

Loading