Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring proposal: cleaning up the internal APIs #545

Open
clarfonthey opened this issue Aug 20, 2024 · 8 comments
Open

Refactoring proposal: cleaning up the internal APIs #545

clarfonthey opened this issue Aug 20, 2024 · 8 comments

Comments

@clarfonthey
Copy link
Contributor

This is my writeup re: #543 (comment)

cc #290 as this should also strive to fix that as well. Ideally, by the end of the refactor, we should be able to add the following to the crate root:

#![deny(
    clippy::missing_safety_doc,
    clippy::missing_errors_doc,
    clippy::undocumented_unsafe_blocks
)]

Documenting the actual layout

Right now, the documentation for the crate on the actual layout of the hash table is incredibly lacking. It mostly just points to the C++ SwissTable implementation and says "we did that!" while ignoring the fact that, obviously, it is not quite that. There are a lot of subtleties to the Rust code that are completely omitted from the basic description; both the downsides and the upsides are completely undocumented, IMHO, since even if they do exist they're difficult to find in the code.

We should start out by describing the layout from its broadest points first and use this to design the API. I'm also going to be renaming some terms so we can use the more specific names for individual pieces inside the API.

So, here's how I've decided to describe it:

The main benefit of this hash table implementation is that unlike other implementations, instead of a unified bucket array, we have two separate data and control arrays. Having a small, dedicated control array which does not depend on the type inside the table makes it extremely easy to traverse the table using SIMD instructions. It also improves cache locality by keeping the control data as small as possible, so that any operation on the table can happen entirely in cache, even if the data itself is a cache miss.

While the control and data arrays are separate, we can conceptually treat them together as an array of buckets, which forms our hash table.

Control Array

The control array contains a series of tags which indicate the status of any given bucket. To make SIMD operations efficient, we split the tags into groups whose size depends on the specific SIMD implementation. This imposes a restriction on the control array that its size is at least a multiple of the group size, and its alignment is at least a multiple of the group alignment. We additionally require that the number of buckets, and hence the number of tags, be a power of two.

There are two special tags, EMPTY and DELETED, which correspond to buckets which have either always been empty, or who have recently become empty. All other tags are considered FULL.

Regardless of implementation, the tags are:

  • EMPTY: 0b1_1111111
  • DELETED: 0b1_0000000
  • FULL: 0b0_xxxxxxx, where xxxxxxx are seven bits from the hash

We can use the term NONFULL to refer to buckets which are either EMPTY or DELETED.

Data Array

The data array is simply an array of maybe-uninit items, where the exact value of these items depends on what the hash table is being used for. The alignment of these items further imposes alignment constraints on the control array, and similarly, the control array further constrains the alignment of these items. However, no additional padding is placed between the items in the array.

The data array is indexed in the reverse order of the control array, and located before the control array in memory. This lets us store a single pointer to the control array to represent both arrays: if cast to a tag, the pointer can offset n to get the control for the nth bucket, and if cast to data, the pointer can offset -n - 1 to get the data for the nth bucket.

Padding

Since the alignment of the control and data arrays may differ, the very start of the allocation for the hash table may contain some padding. Specifically, if:

  • The data has a lower alignment than the control
  • The size of the data array in bytes is not aligned to the control

Then the padding will be equal to this discrepancy, to ensure that the control array is aligned.

Redoing the raw table

Conceptually, let's reframe a raw table as being parameterized by N, where N is a power-of-two number of items in the table. Then, this table (specifically RawTableInner) contains:

  • bucket_mask (N - 1)
  • ctrl (pointer to the control array)
  • items (number of full buckets)
  • growth_left (N - items)

There have been some comments regarding whether the number of fields can be removed, but I'm going to focus on instead discussing the structures that are useful for operating on the table itself.

Note that these wrapper types don't have to exist for optimizing the exact contents of the raw table. Since the number of fields is small, we can think of these types as little API structures that let us encapsulate the data we need to perform a given operation, that we can then wrap in other types like iterators to do the things we want them to do. Although each individual operation will likely refer to an unsafe method which does not care about the specific types involved or the lifetimes, by having these higher-level wrappers even on the internals, we can reduce the potential to cause undefined behaviour by creating higher-level wrappers on them.

It's also worth noting that undefined data here is specifically reading uninitialized data and breaking lifetime rules. Although changing an item's hash without moving it to a different bucket or leaking an item might cause the table to work incorrectly, it shouldn't trigger UB, especially because PartialEq and Hash can't cause UB by themselves.

Also, the below types can be made to each have a "non-generic" version, but as a general rule, we should not be interacting between these versions of the code. Specifically, the correct usage would be, for example, One<T, A> to interact with Two<T, A>, which internally calls OneInner and TwoInner methods, rather than OneInner directly interacting with TwoInner. This helps us minimize the number of unsafe methods and hence the opportunities to cause UB, and as long as we add inline tags in the right places, it shouldn't cause any performance issues either.

Tag

Right now, we have a Group to indicate groups of control, and I propose adding in a Tag as well to act as a newtype over u8. The goal of this is to help make some internal APIs clearer (is this a byte because it's unknown data, or is it explicitly a control tag?) and also to have a place to put a lot of the associated constants, methods, etc., for groups.

If it turns out that this just adds way too much unnecessary unsafe code (there's no safe transmute for transparent newtypes yet), we can omit it. But at least, it's an API nicety that I would like to add.

RawBuckets<T, A>

This is a wrapper over the control pointer and something to indicate its length. Its job is to do all of the pointer arithmetic we care about for the table, namely:

  • Finding the start and end of the data
  • Finding the start and end of the control
  • Finding a particular group or tag inside the control
  • Finding a particular MaybeUninit<T> inside the data
  • Allocating and deallocating tables without any regard for their contents

It doesn't care about the number of filled buckets, the load of the table, etc., and only cares about performing these operations correctly.

OwnedBuckets<T, A>

This type owns a fixed set of buckets, but cannot be resized. Using an analogy, if RawTable<T, A> is Vec<T, A>, then this is Box<[T], A>.

The goal of this wrapper would be to allow deriving operations like Clone, Default, and Drop on types which own a table, like draining and into-iterators.

Right now, these types simply hold onto a raw allocation triple (pointer, layout, allocator) and then free the allocation when they drop. This makes it difficult to clone these types, since the actual structs that "own" the allocation aren't part of this allocation triple. Most likely, OwnedBuckets<T, A> would simply wrap a RawBuckets<T, A> and an allocator, but we could also precompute the layout and allocation start if we want to.

Probe<'a, T> and ProbeMut<'a, T>

This implements all the methods we need to actually probe the table for a given hash value. Like the various find methods on RawTable, it will simply fail if the table is too small to insert something.

Scan<'a, T>, ScanMut<'a, T>

These are the raw iterator types that we use for traversing the table in linear order, as we do for iterators.

Drain<'_, T>, IntoIter<T, A>

These are the extra iterators we need for draining the table. Internally, they share the same code, but the difference is whether they own the table and are responsible for dropping it.

RawTable<T, A>

This is, of course, the actual table, and it also has the ability to verify the table load, rehash the contents, and resize the table.

Specifics of implementation

This is the final part where I attempt to create my own "FAQ" section for questions that nobody asked, in no particular order or usefulness to anyone:

  • Instead of Bucket<T> which is awkward and represents sometimes a pointer and sometimes an index, we should instead operate on indices for the table and return MaybeUninit<T> references whenever data needs to be referenced.
  • Similarly, InsertSlot should go. It really doesn't add any safety guarantees over just passing an index, and since methods have the word insert in their names anyway, it's not like people are going to misuse it.
  • We should be very clear in the code whether something is a *const Group or a *const Tag, since a lot of the code is sloppy and uses *const u8 everywhere. This is part of why I prefer having a dedicated Tag type.
  • A lot of comments on private fields or methods use single comments, when they could use doc comments and show up on --document-private-items instead.
  • Internally, Scan, ScanMut, Drain, and IntoIter would likely all use the same RawIterRange type similar to the existing one, but we would explicitly use the safe wrappers for their implementations on HashMap and HashSet and HashTable instead of the internal one.
  • A large portion of these types exist because I would like to be able to just derive(Clone) on iterators or not add my own Drop implementation and have them work.

Nice writeup, but what now?

I'm basically writing up my plans here for a couple reasons:

  1. So people know they're coming and people don't step on each other's toes with merge conflicts
  2. To get a general consensus on the API design, the actual internal layout, etc. before I write up code
  3. So I don't just dump a refactor on someone's doorstep without them knowing ahead of time

I will likely be the one to do a lot of this refactoring, unless someone else is substantially more interested. I don't mind sharing the work, just know the reality of open-source work and that generally, people aren't that excited to implement other people's proposals.

The plan is to implement several of these types incrementally rather than doing it all at once. This also allows people to prepare for API breakages if they use the raw API.

@Amanieu
Copy link
Member

Amanieu commented Aug 21, 2024

I generally agree with this plan and would encourage you to continue working on this (though I have some thoughts about the specific, below). Most of the code in RawTable is quite old and predates MaybeUninit, which is why it is not used there. Also the RawTable code in particular is quite messy because of the split between RawTable and RawTableInner which exists solely to reduce compilation times (HashMap is heavily used in the ecosystem).

cc @JustForFun88 who has done a lot of work on documenting the internals of RawTable and optimizing its compilation times.

Some general thoughts:

  • I'd like to remove RawTable from the crate's public API. Almost all of the use cases for it are now provided through HashTable which exposes an entirely safe API. As far as I know, the only crate that actually needs the low level access to buckets that RawTable provides is https://github.com/jonhoo/griddle, which is no longer maintained.
  • Considering the above, it would make sense to make HashTable actually be the low-level implementation that was previously RawTable rather than keeping it as a wrapper around RawTable.
  • HashMap and HashSet could then be adapted to wrap HashTable instead of RawTable.

Now onto more specific thoughts:

The control array contains a series of tags which indicate the status of any given bucket. To make SIMD operations efficient, we split the tags into groups whose size depends on the specific SIMD implementation. This imposes a restriction on the control array that its size is at least a multiple of the group size, and its alignment is at least a multiple of the group alignment. We additionally require that the number of buckets, and hence the number of tags, be a power of two.

The actual control array is N + GROUP_SIZE tag bytes where the last group is a mirror of the first one. This avoids the need to bounds checks and explicit wrapping when searching within a group.

growth_left (N - items)

This is actually N - items - deleted.

Probe<'a, T> and ProbeMut<'a, T>

Scan<'a, T>, ScanMut<'a, T>

Internally, Scan, ScanMut, Drain, and IntoIter would likely all use the same RawIterRange type similar to the existing one, but we would explicitly use the safe wrappers for their implementations on HashMap and HashSet and HashTable instead of the internal one.

The primary motivation behind the current design of RawIter was to be able to have an iterator which didn't borrow the table so that you could make modifications to the table while iterating over it. This allows it to be the "common base" for all iterators, both mutable and immutable, and even those like ExtractIf which modify the control array as it iterates.

In hindsight, a better design would have been an iterator type that doesn't actually implement the Iterator trait but instead has an inherent next method that is passed a reference to the control array only for the duration of the call.

Finally, we wouldn't actually need separate Scan and ScanMut types since these would just correspond to Iter and IterMut on HashTable.

@clarfonthey
Copy link
Contributor Author

clarfonthey commented Aug 22, 2024

The actual control array is N + GROUP_SIZE tag bytes where the last group is a mirror of the first one. This avoids the need to bounds checks and explicit wrapping when searching within a group.

This bit actually confuses me a bit-- aren't we already aligned to a multiple of the group size, so, we don't need to add any extra padding? Or is this specifically for the case when N < GROUP_SIZE? I thought that ensuring that N is a power of two also meant that you would have an even number of groups to scan, but I could have been misread.

  • I'd like to remove RawTable from the crate's public API. Almost all of the use cases for it are now provided through HashTable which exposes an entirely safe API. As far as I know, the only crate that actually needs the low level access to buckets that RawTable provides is https://github.com/jonhoo/griddle, which is no longer maintained.

That's one benefit of being a published crate instead of a standard library API: we can just publish a breaking version and it's totally fine. I was assuming that ultimately, we'd end up with a breaking version anyway once we replaced the APIs, so, I'm down with this.

@Amanieu
Copy link
Member

Amanieu commented Aug 22, 2024

This bit actually confuses me a bit-- aren't we already aligned to a multiple of the group size, so, we don't need to add any extra padding? Or is this specifically for the case when N < GROUP_SIZE? I thought that ensuring that N is a power of two also meant that you would have an even number of groups to scan, but I could have been misread.

We are actually using the "unaligned groups" variant that was briefly covered in the CppCon presentation since that is the variant used by the C++ implementation of SwissTable in Abseil.

@clarfonthey
Copy link
Contributor Author

Oh, suddenly the triangular probing makes a lot more sense. Thank you for clarifying that.

@clarfonthey
Copy link
Contributor Author

Adding in a few extra notes here re: some other conversations that were had in other places:

  • Will want to look into the internals for dashmap specifically to ensure that the new API is usable by them. There is some concern that a lack of lifetime-erased types will cause issues with dashmap, but there may be a way around this.
  • indexmap was specifically concerned about the size of the various types being passed around, and will want to also focus on that as well.

@clarfonthey
Copy link
Contributor Author

clarfonthey commented Oct 3, 2024

Poking at this more: one other weird question I have regarding the alignment of the actual control table.

If we're using unaligned buckets, then why is the group width factored at all into the alignment of the control data? Presumably, we'd just need to align to the bucket array, right?

I'm also just looking at the alignment calculations in the current RawTableInner and think that we could probably replace the entire layout calculation with the following, now that it's stable:

Layout::from_size_align_unchecked(num_buckets + Group::SIZE, Group::SIZE).extend(Layout::array::<T>(num_buckets))

but I want to make sure that this is actually correct. Note that we'd be kind of abusing the semantics here since, while extend actually puts the padding between the two arrays, we'd just shift it to the beginning and ignore alignment for the groups. The resulting offset is the same regardless.


EDIT: Now realising my mistake is that the buckets are in reverse order. I'll redo that code to something that actually works.

@Amanieu
Copy link
Member

Amanieu commented Oct 3, 2024

There's an extra Group::SIZE entries in the control data which mirror the first Group::SIZE entries. This is needed to support unaligned group loads which point to the very end of the control array.

Also in the past we avoided using extend because it generated worse code due to extra overflow checks. It also increased compilation times slightly.

@clarfonthey
Copy link
Contributor Author

Realising now that basically all of my problems were due to accidentally flipping the two fields, so, now all the alignment makes sense.

Will try and take another stab at this tomorrow. Hopefully at least some of the code can be simplified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants